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II. RELATED APPEALS AND INTERFERENCES 

There are no related appeals or interferences. 

III. STATUS OF CLAIMS 

Claims 79-1 15 are pending in the application. Claims 1-78 have been canceled. Claims 
79-1 15 stand finally rejected, as follows: claims 79-80, 82-88, 91-95, 98, 104, 106 and 108-1 14 
are rejected under 35 U.S.C. §102(e) as being anticipated by U.S. Patent No. 5,699,503 to 
Bolosky et al.; claims 81, 89, 96, 100, 105, 107 and 1 15 are rejected under 35 U.S.C. §103(a) as 
being obvious over U.S. Patent No. 5,699,503 to Bolosky et al.; and claims 90, 97, 99 and 101- 
103 are rejected under 35 U.S.C. §103(a) as being obvious over U.S. Patent No. 5,699,503 to 
Bolosky et al. in view of U.S. Patent No. 4,914,570 to Peacock. The present appeal is directed to 
claims 79-115. 

IV. STATUS OF AMENDMENTS 

Claims 79-1 15 have not been amended. These claims are reproduced in Appendix A 
attached hereto. 

V. SUMMARY OF THE INVENTION 

The present invention is directed to a system and method for providing network 
processing and stored data access that is configured to be fiiUy scalable and/or fiiUy survivable. 
Two different embodiments of the invention are shown in Figs. 1 and 2 of the application, 
attached hereto as Appendix B. As can be seen, at least one server (also referred to as an 
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application processor) is provided that is operable to process user requests. The server is 
connected to a switch, which is in turn connected to at least one data storage device. 

In one aspect of the invention, the system is fully "scalable" in the sense that additional 
servers can be added to the system as demand for a particular application increases, without 
adding additional data storaee devices . Conversely, servers can be removed from the system 
without removing data storage devices. In a similar manner, additional data storage devices can 
be added to the system as storage requirements for a particular application increase, without 
adding additional servers . Conversely, data storage devices can be removed from the system 
without removing servers. Thus, the system is scalable to increase or decrease server capacity 
without changing the data storage capacity, and/or is scalable to increase or decrease data storage 
capacity without changing the server capacity. 

In another aspect of the invention, the system includes at least two servers that apply 
substantially the same application(s) when processing user requests, and at least two data storage 
devices that contain substantially identical data. The system is fully "survivable" in the sense 
that, if any one of the servers fails, user requests can be processed by any of the other servers in 
the system that are operable. Likewise, if any one of the data storage devices fails, substantially 
identical data can be retrieved from any of the other data storage devices that are operable. Thus, 
the system is survivable and able to process user requests in the event of a failure of any one of 
the servers, and is survivable and able to retrieve data in the event of a failure of any one of the 
data storage devices. 

Claims 79-85, 89-97, 102, 104-108, 1 10 and 1 12 are directed to the scalability aspect of 
the claimed invention. Claims 98-101 and 1 14-115 are directed to the survivability aspect of the 
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claimed invention. Claims 86-88, 103, 109, 1 1 1 and 1 13 are directed to both the scalability and 
survivability aspects of the claimed invention. 

VI. ISSUES 

The issues on appeal are as follows: 

A. Whether claims 79-80, 82-88, 91-95, 98, 104, 106 and 108-1 14 are unpatentable 
under 35 U.S.C. §102(e) as being anticipated by U.S. Patent No. 5,699,503 to Bolosky et al. 

B. Whether claims 8 1 , 89, 96, 1 00, 1 05, 1 07 and 1 1 5 are unpatentable under 35 
U.S.C. § 103(a) as being obvious over U.S. Patent No. 5,699,503 to Bolosky et al. 

C. Whether claims 90, 97, 99 and 101-103 are unpatentable under 35 U.S.C. §103(a) 
as being obvious over U.S. Patent No. 5,699,503 to Bolosky et al. in view of U.S. Patent No. 
4,914,570 to Peacock. 

VII. GROUPING OF THE CLAIMS 

With respect to the rejection stated in Issue A, claims 79-80, 82-85, 91-95, 104, 106, 108, 
1 10 and 1 12 stand or fall together; claims 98 and 1 14 stand or fall together; and claims 86-88, 
109, 1 1 1 and 1 13 stand or fall together. As discussed in Section VIII. A.4 below, these three 
different groups of claims are separately patentable. 

With respect to the rejection stated in Issue B, claims 81, 89, 96, 105 and 107 stand or fall 
together; and claims 100 and 1 15 stand or fall together. As discussed in Section VIII.B.3 below, 
these two different groups of claims are separately patentable. 

With respect to the rejection stated in Issue C, claims 90, 97, and 102 stand or fall 
together; claims 99 and 101 stand or fall together; and claim 103 stands or falls alone. As 
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discussed in Section VIII.C.3 below, these three different groups of claims are separately 
patentable. 

VIIL ARGUMENT 
A. Applicant's Claims are not Apticipated by Bolosky 

The Examiner rejected claims 79-80, 82-88, 91-95, 98, 104, 106 and 108-1 14 under 35 
U.S.C. § 102(e) as being anticipated by U.S. Patent No. 5,699,503 to Bolosky et al. ("Bolosky"), 
attached hereto as Appendix C. However, as discussed below, Bolosky does not disclose the 
survivabihty and/or scalability aspects of the claimed invention. 
1. The Bolosky Disclosure 

Bolosky discloses a media server system, such as a video-on-demand system, in which 
video image sequences {e.g., a movie) are transmitted from the system to subscribers in response 
to user requests. The system includes a controller {e.g., controller 16 of Fig. 2) and a plurality of 
subsystems {e.g., subsystems 18A, 18B and 18C of Fig. 2). Bolosky, col. 5, 1. 61 to col. 6, 1. 10. 
Each subsystem comprises a single microprocessor {e.g., microprocessor 20 A of Fig. 2) and one 
or more data storage devices {e.g., data storage devices 22A and 24A of Fig. 2). Bolosky, col. 6, 
11. 1 1-19. In operation, the controller cooperates with the microprocessor of each of the 
subsystems to schedule the transmission of video image sequences stored on the data storage 
devices to the subscribers. Bolosky, col. 6, 11. 20-23. 

The video image sequences are stored on the data storage devices of all of the subsystems 
by dividing them into sequential blocks of data and "striping" them across the primary portions 
of the data storage devices. Bolosky, col. 6, 11. 40-43. "Striping" refers to the method in which a 
first block of data is stored on a first data storage device and each sequentially following block of 
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data is stored on the next sequential data storage device. Bolosky, col. 6, 11. 46-49. When 

reaching the last data storage device, the next block of data wraps around and is stored on the 

first data storage device of the first subsystem. Bolosky, col. 6, 11. 49-51. This continues until all 

the blocks of data are stored across all of the data storage devices. Bolosky, col. 6, 11. 51-53. 

After the blocks of data are stored on the primary portions of the data storage devices, 

"declustered mirroring" is used to store a copy of the same data on the secondary portions of the 

data storage devices. Bolosky, col. 6, 11. 57-61. An example of "declustered mirroring" is shown 

in Fig. 3D. In this example, a block of data stored on the primary portion of a data storage 

device of a first subsystem {e.g., "Subsystem 1, SDl, Block A") is divided into first and second 

sub-blocks of data, wherein the first sub-block of data is stored on the secondary portion of a 

data storage device of a second subsystem (e.g., "Subsystem 2, SD3, Sub-Block Al") and the 

second sub-block of data is stored on the secondary portion of a data storage device of a third 

subsystem (e.g., "Subsystem 3, SD5, Sub-Block A2"). Bolosky, col. 9, lines 14-34. In the event 

of a failure of the first subsystem, the first and second sub-blocks of data can be transmitted from 

the second and third subsystems to the subscribers. Id. 

2. Bolosky Does Not Disclose the Scalability Aspect of the Claimed 
Invention 

Independent claims 79, 82, 91, 104 and 106 (and dependent claims 80, 83-88, 92-95 and 
108-1 13), which are directed to the scalability aspect of the claimed invention,^ each require that 
a server operates independently of a data storage device so as to permit the addition (or removal) 
of a server without the addition (or removal) of a data storage device (e.g., as demand for a 
particular appHcation increases or decreases). Bolosky does not disclose or suggest this 

' Dependent claims 86-88, 109, 1 1 1 and 1 13 are also directed to the survivability aspect of the claimed invention, 
discussed in Section VIII.A.3 below, 
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limitation. Rather, Bolosky discloses multiple subsystems that each include a microprocessor 
tied to one or more data storage devices. Within each subsystem, the microprocessor does not 
operate independently of the data storage devices. As such, it is not possible to: (1) add a new 
microprocessor without also adding one or more new data storage devices to create a new 
subsystem; or (2) remove a microprocessor of a subsystem without also removing the data 
storage devices of the subsystem. 

The Examiner repeatedly cites to three different portions of the Bolosky specification to 
support his contention that Bolosky discloses a system that is fiilly scaleable. However, none of 
these portions disclose a server that operates independently of a data storage device to permit the 
addition (or removal) of a server without the addition (or removal) of a data storage, as claimed 
by Applicant: 

1. Boloskv, col. 8, 1. 38 to col. 9, 1. 34 

This portion of the Bolosky specification discloses that a declustering number 
(i.e., the number of sub-blocks of data stored across multiple data storage devices) 
can be chosen so as to tolerate the failure of more than one data storage device or 
subsystem. An alternative embodiment of "declustered mirroring" is also 
disclosed, wherein the burden of performing failure mode processing is spread 
across a larger number of data storage devices than in the preferred embodiment. 

2. Boloskv, col. 5, 1. 61 to col. 6. 1. 23 

This portion of the Bolosky specification discloses that, although the preferred 
embodiment describes three subsystems, a larger number of subsystems will 
typically be employed. This portion also discloses that while each subsystem of 
the preferred embodiment includes a single microprocessor that is responsible for 
controlling two data storage devices, each subsystem may alternatively include 
one data storage device or more than two data storage devices. 

3. Boloskv, col. 7, 11. 4-28 

This portion of the Bolosky specification discloses that the declustering number 
may vary, and that a higher declustering number can be chosen to: (1) lessen the 
burden of failure mode processing by any one data storage device, and (2) reduce 
the bandwidth reserved for failure mode processing. 
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All of the above portions of the Bolosky specification are directed to the system's 
capability of varying the number of data storage devices within a subsystem to enhance fault 
tolerance. Nowhere does Bolosky disclose a microprocessor that operates independently of a 
data storage device to permit the addition (or removal) of a microprocessor without the addition 
(or removal) of a data storage device. Rather, in Bolosky, each microprocessor is tied to 
particular data storage devices (albeit the number of data storage devices may vary). 

Thus, independent claims 79, 82, 91, 104 and 106 (and dependent claims 80, 83-88, 92- 
95 and 108-1 13) are not anticipated by Bolosky and should be allowed. 

3. Bolosky Does Not Disclose the Survivability Aspect of the Claimed 
Invention 

Independent claims 98 and 1 14 (and dependent claims 86-88, 109, 1 1 1 and 1 13), which 
are directed to the survivability aspect of the claimed invention,^ each require first and second (or 
a plurality of) data storage devices that each store substantiallv the same data such that, in the 
event of a failure of anv one of the data storage devices, the data is accessible from any other of 
the data storage devices that are operable. Bolosky does not disclose or suggest this limitation. 
Rather, in Bolosky, each data storage device stores different blocks of data such that no one data 
storage device stores substantially the same data as anv other data storage device. 

The Examiner argues that the "declustered mirroring" process of the Bolosky system 
discloses this limitation. Not true. As can be seen in Fig. 3D, each data storage device stores a 
completely different set of data - SDl stores Block A and Sub-Blocks II and G2, SD2 stores 
Block B and Sub-Blocks Fl and D2, etc. As such, the Bolosky process of storing different 
blocks of data on each data storage device is directly contrary to the claimed invention. 

^ Dependent claims 86-88, 109, 1 1 1 and 1 13 are also directed to the scalability aspect of the claimed invention, 
discussed in Section VIILA.2 above. 
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Thus, independent claims 98 and 114 (and dependent claims 86-88, 99-101, 103, 109, 
1 1 1, 1 13 and 1 15) are not anticipated by Bolosky and should be allowed. 

4. Different Groups of Claims are Separately Patentable 

Claims 79-80, 82-85, 91-95, 104, 106, 108, 1 10 and 1 12 are directed to the scalability 
aspect of the claimed invention, and are not anticipated by Bolosky for the reasons discussed in 
Section VIII. A.2 above. Claims 98 and 1 14 are directed to the survivability aspect of the 
claimed invention, and are not anticipated by Bolosky for the reasons discussed in Section 
VIII. A.3 above. Claims 86-88, 109, 1 1 1 and 1 13 are directed to both the scalability and 
survivability aspects of the claimed invention, and are not anticipated by Bolosky for the reasons 
discussed in Sections VIII.A.2 and VIII.A.3 above. 

B. Applicant's Claims are not Obvious Over Bolosky 

The Examiner rejected claims 81, 89, 96, 100, 105, 107 and 115 under 35 U.S.C. § 103(a) 

as being obvious over Bolosky (described in Section VIII. A. 1 above). A prima facie case of 

obviousness for rejecting these claims has not been estabUshed. The cited reference does not 

disclose or suggest Applicants claimed invention. The Patent and Trademark Office's burden of 

establishing a prima facie case of obviousness is not met unless "'the teachings from the prior art 

itself would appear to have suggested the claimed subject matter to a person of ordinary skill in 

the art.'" hi re Bell 26 U.S.P.Q. 2d 1529, 1531 (Fed. Cir. 1993)(quoting hi re Rinehart , 189 

U.S.P.Q. 143,147 (C.C.P.A. 1976)). 

1. Bolosky Does Not Disclose or Suggest the Scalability Aspect of the 
Claimed Invention 

Dependent claims 81, 89, 96, 105 and 107 are directed to the scalability aspect of the 
claimed invention. Claim 81 depends from independent claim 79; claim 89 depends from 
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independent claim 82; claim 96 depends from independent claim 91; claim 105 depends from 
independent claim 104; and claim 107 depends from independent claim 106. Each of these 
independent claims require that a server operates independently of a data storage device so as to 
permit the addition (or removal) of a server without the addition (or removal) of a data storage 
device. Bolosky does not disclose or suggest this limitation. Rather, as discussed in Section 
VIII. A.2 above, Bolosky discloses multiple subsystems that each include a microprocessor tied 
to one or more data storage devices. Thus, because the Examiner has failed to meet his burden 
of establishing a Erima facie case of obviousness, claims 81, 89, 96, 105 and 107 should be 
allowed. 

2. Bolosky Does Not Disclose or Suggest the Survivability Aspect of the 
Claimed Invention 

Dependent claims 100 and 1 15 are directed to the survivability aspect of the claimed 
invention. Claim 100 depends from independent claim 98; and claim 115 depends from 
independent claim 1 14. Each of these independent claims require first and second (or a plurality 
of) data storage devices that each store substantiallv the same data such that, in the event of a 
failure of anvone of the data storage devices, the data is accessible from any other of the data 
storage devices that are operable. Bolosky does not disclose or suggest this Umitation. Rather, 
as discussed in Section VIII.A.3 above, Bolosky discloses data storage devices that each store 
different blocks of data such that no one data storage device stores substantially the same data as 
anv other data storage device. Thus, because the Examiner has failed to meet his burden of 
establishing a prima facie case of obviousness, claims 100 and 1 15 should be allowed. 



10 



Serial No.: 09/021,466 
Docket No.: 1177 



3. Different Groups of Claims are Separately Patentable 

Claims 81, 89, 96, 105 and 107 are directed to the scalability aspect of the claimed 
invention, and are not obvious over Bolosky for the reasons discussed in Section VIII.B.l above. 
Claims 100 and 1 15 are directed to the survivability aspect of the claimed invention, and are not 
obvious over Bolosky for the reasons discussed in Section VnLB.2 above. 

C. Applicants Claims are not Obvious Over Bolosky in View of Peacock 
The Examiner rejected claims 90, 97, 99 and 101-103 under 35 U.S.C. § 103(a) as being 
obvious over Bolosky (described in Section VIII. A. 1 above) in view of U.S. Patent No. 
4,914,570 to Peacock ("Peacock"), attached hereto as Appendix D. Peacock discloses a multiple 
processor computer system. A prima facie case of obviousness for rejecting these claims has 
not been established. The cited references do not disclose or suggest Applicant's claimed 
invention. Furthermore, these cited references are not properly combinable. Still further, even if 
these cited reference are combined, they do not disclose or suggest Applicant's claimed 
invention. The Patent and Trademark Office's burden of establishing a prima facie case of 
obviousness is not met unless "'the teachings from the prior art itself would appear to have 
suggested the claimed subject matter to a person of ordinary skill in the art.'" In re Bell 26 
U.S.P.Q. 2d 1529, 1531 (Fed. Cir. 1 993 ^quoting In re Rinehart , 189 U.S.P.Q. 143,147 (C.C.P.A. 
1976)). 

1. Bolosky and Peacock Do Not Disclose or Suggest the Scalability 

Aspect of the Claimed Invention and are not Properly Combinable 

Independent claim 102 and dependent claims 90, 97 and 103 are directed to the 
scalability aspect of the claimed invention.^ Claim 90 depends from independent claim 82; claim 

^ Dependent claim 103 is also directed to the survivability aspect of the claimed invention, discussed in Section 
VIII.C.2 below. 
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97 depends from independent claim 91; and claim 103 depends from independent claim 102. 
Independent claim 102 and independent claims 82 and 91 each require that a server operates 
independently of a data storage device so as to permit the addition (or removal) of a server 
without the addition (or removal) of a data storage device. Neither Bolosky nor Peacock disclose 
or suggest this limitation. Rather, as discussed in Section VIII.A.2 above, Bolosky discloses 
multiple subsystems that each include a microprocessor tied to one or more data storage devices. 
Peacock merely discloses a multiple processor computer system in which each of the processors 
has its own associated memory, but also has access to the memories of the other processors. 
Thus, Applicant's claimed invention is clearly distinguishable from Bolosky and Peacock. 

Furthermore, "[b]efore the PTO may combine the disclosures of two or more prior 
art references in order to establish prima facie obviousness, there must be some suggestion for 
doing so, found either in the references themselves or in the knowledge generally available to 
one of ordinary skill in the art." In re Jones . 21 U.S.P.Q. 2d 1941, 1943-44 (Fed. Cir. 1992). If 
there is no technological motivation for modifying a reference, then the reference should not be 
part of a §103 rejection. 

There is no motivation to combine Bolosky and Peacock. Bolosky discloses the 
various components of a media server system, such as a video-on-demand system. By contrast. 
Peacock discloses the inner-workings of a multiple processor computer system. Nothing in 
either reference suggests that any one of the various components of Bolosky could be modified 
in accordance with the teachings of Peacock. 

Still fixrther, even if they are combined, the combination of Bolosky and Peacock 
does not disclose or suggest Applicant's claimed invention. Specifically, the combination does 
not disclose or suggest a server that operates independently of a data storage device so as to 
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permit the addition (or removal) of a server without the addition (or removal) of a data storage 
device., as claimed by Applicant. 

Thus, because the Examiner has failed to meet his burden of establishing a prima facie 
case of obviousness, claims 90, 97, 102 and 103 should be allowed. 

2. Bolosky and Peacock Do Not Disclose or Suggest the Survivability 
Aspect of the Claimed Invention and are not Properly Combinable 

Dependent claims 99, 101 and 103 are directed to the survivability aspect of the claimed 

invention."^ Claims 99 and 101 depend from independent claim 98; and claim 103 depends from 

independent claim 102. Independent claims 98 and 102 each require a plurality of data storage 

devices that each store substantiallv the same data such that, in the event of a failure of any one 

of the data storage devices, the data is accessible from anv other of the data storage devices that 

are operable. Neither Bolosky nor Peacock disclose or suggest this limitation. Rather, as 

discussed in Section VIII. A.3 above, Bolosky discloses data storage devices that each store 

different blocks of data such that no one data storage device stores substantially the same data as 

anv other data storage device. Peacock merely discloses a multiple processor computer system 

in which each of the processors has its own associated memory. Thus, Applicant's claimed 

invention is clearly distinguishable from Bolosky and Peacock. 

Furthermore, "[b]efore the PTO may combine the disclosures of two or more prior 

art references in order to establish prima facie obviousness, there must be some suggestion for 

doing so, found either in the references themselves or in the knowledge generally available to 

one of ordinary skill in the art." In re Jones , 21 U.S.P.Q. 2d 1941, 1943-44 (Fed. Cir. 1992). If 



^ Dependent claim 103 is also directed to the scalability aspect of the claimed invention, discussed in Section 
VIII.C.l above. 
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there is no technological motivation for modifying a reference, then the reference should not be 
part of a §103 rejection. 

There is no motivation to combine Bolosky and Peacock. Bolosky discloses the 
various components of a media server system, such as a video-on-demand system. By contrast, 
Peacock discloses the inner-workings of a multiple processor computer system. Nothing in 
either reference suggests that any one of the various components of Bolosky could be modified 
in accordance with the teachings of Peacock. 

Still further, even if they are combined, the combination of Bolosky and Peacock 
does not disclose or suggest Applicant's claimed invention. Specifically, the combination does 
not disclose or suggest a pluraUty of data storage devices that each store substantially the same 
data such that, in the event of a failure of any one of the data storage devices, the data is 
accessible from any other of the data storage devices that are operable, as claimed by Applicant. 

Thus, because the Examiner has failed to meet his burden of establishing a prima facie 
case of obviousness, claims 99, 101 and 103 should be allowed. 

3. Different Groups of Claims are Separately Patentable 
Claims 90, 97 and 102 are directed to the scalability aspect of the claimed invention, and 
are not obvious over Bolosky in view of Peacock for the reasons discussed in Section VIII.C.l 
above. Claims 99 and 101 are directed to the survivability aspect of the claimed invention, and 
are not obvious over Bolosky in view of Peacock for the reasons discussed in Section VIII.C.2 
above. Claim 103 is directed to both the scalability and survivability aspects of the claimed 
invention, and is not obvious over Bolosky in view of Peacock for the reasons discussed in 
Sections VIII.C.l and Vni.C.2 above. 
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IX. APPENDICES 

Attached hereto are the following Appendices: 
Appendix A - Claims on Appeal 
Appendix B - Figs. 1 and 2 of the Application 
Appendix C - U.S. Patent No. 5,699,503 to Bolosky et al. 
Appendix D - U.S. Patent No. 4,914,570 to Peacock 



X. SUMMARY 

For the foregoing reasons. Applicant respectfully submits that claims 79-1 15 are 
patentable over the cited references and should be allowed. Accordingly, Applicant respectfully 
requests that the Board reverse the Examiner's rejections of claims 79-1 15, and allow claims 79- 
115. 



Respectfully submitted, 

Bv: ZJZdj'^ L . Cm^I^ 
Judith L. Carlson, Reg. No. 41,904 
STINSON MORRISON HECKER LLP 
1201 Walnut Street, Suite 2800 
P.O. Box 419251 
Kansas City, MO 64141-6251 
Telephone: (816) 842-8600 
Facsimile: (816) 691-3495 
Attorney for Applicant 
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APPENDIX A 

Claims on Appeal 



The text of the claims on appeal are as follows: 

79. A scalable system for providing network processing and stored data access, the system 
comprising: 

(a) a server operative to process user requests; 

(b) a switch operatively connected to the server; 

(c) a data storage device operatively connected to the switch; and 

(d) wherein the server operates independently of the data storage device and is 
connected to the data storage device via the switch in a manner to permit the inclusion of an 
additional server to process other user requests without the inclusion of an additional data storage 
device. 

8 0. The system of Claim 79 wherein the server operates independently of the data 
storage device and is connected to the data storage device via the switch in a manner to permit 
the inclusion of an additional data storage device without the inclusion of an additional server. 

8 1 . The system of Claim 79 wherein the server applies an application to the user 
requests, the application selected from the group consisting of: a mail application, a news 
application, a directory application, a content application, a groupware application, and an 
internet protocol (IP) service. 



82. A scalable system for providing network processing and stored data access, the 

system comprising: 

(a) at least first and second servers operative to process at least first and 

second user requests, respectively; 

(b) a switch operatively connected to each of the servers; 

(c) a plurality of data storage devices operatively connected to the switch; and 

(d) wherein the servers operate independently of the data storage devices and 
are connected to the data storage devices via the switch in a manner to permit the inclusion of an 
additional server to process at least an additional user request without the inclusion of an 
additional data storage device. 

83. The system of Claim 82 wherein the servers operate independently of the data 
storage devices and are connected to the data storage devices via the switch in a manner to 
permit the inclusion of an additional data storage device without the inclusion of an additional 
server. 

84. The system of Claim 83 wherein the servers operate independently of the data 
storage devices and are connected to the data storage devices via the switch in a manner to 
permit the removal of any one of the plurality of data storage devices without the removal of any 
of the servers. 



85. The system of Claim 82 wherein the servers operate independently of the data 
storage devices and are connected to the data storage devices via the sv^itch in a manner to 
permit the removal of any one of the servers without the removal of any of the data storage 
devices. 

86. The system of Claim 82 wherein each of the first and second servers applies an 
application, the application applied by the first server being substantially the same as the 
application applied by the second server such that, in the event of a failure of either of the first 
and second servers, any subsequent user requests will be processed by any other of the servers 
that are operable. 

87. The system of Claim 82 wherein each of the plurality of data storage devices 
stores data, the data stored by each of the plurality of data storage devices being substantially the 
same such that, in the event of a failure of any one of the plurality of data storage devices, the 
data is accessible from any other of the plurality of data storage devices that are operable. 

88. The system of Claim 82 wherein each of the first and second servers applies an 
application, the application applied by the first server being substantially the same as the 
application applied by the second server such that, in the event of a failure of either of the first 
and second servers, any subsequent user requests will be processed by any other of the servers 
that are operable, and wherein each of the plurality of data storage devices stores data, the data 
stored by each of the plurality of data storage devices being substantially the same such that, in 



the event of a failure of any one of the pluraUty of data storage devices, the data is accessible 
from any other of the plurality of data storage devices that are operable. 

89. The system of Claim 82 wherein each of the at least first and second servers 
applies an application selected from the group consisting of: a mail application, a news 
application, a directory application, a content application, a groupware application, and an 
internet protocol (IP) service. 

90. The system of Claim 82 further comprising a load balancer operatively connected 
to each of the at least first and second servers, the load balancer operative to route an additional 
user request to the one of the at least first and second servers with the least load. 

91 . A scalable system for providing network processing and stored data access, the 
system comprising: 

(a) at least first and second sets of servers, each of the sets of servers 
comprising at least first and second servers operative to process at least first and second user 
requests, respectively, and wherein each of the sets of servers applies a separate application; 

(b) a switch operatively connected to each of the servers within each of the 

sets of servers; 

(c) a plurality of data storage devices operatively connected to the switch; and 

(d) wherein the sets of servers operate independently of the data storage 
devices and are connected to the data storage devices via the switch in a manner to permit the 



inclusion of an additional server to any of the sets of servers to process at least an additional user 
request without the inclusion of an additional data storage device. 

92. The system of Claim 91 v^herein the sets of servers operate independently of the 
data storage devices and are connected to the data storage devices via the switch in a manner to 
permit the inclusion of an additional data storage device without the inclusion of an additional 
server to any of the sets of servers. 

93. The system of Claim 92 wherein the sets of servers operate independently of the 
data storage devices and are connected to the data storage devices via the switch in a manner to 
permit the removal of any one of the plurality of data storage devices without the removal of any 
of the servers from any of the sets of servers. 

94. The system of Claim 93 wherein the data stored by any one of the plurality of data 
storage devices is associated with an application applied by any one of the sets of servers. 

95. The system of Claim 91 wherein the sets of servers operate independently of the 
data storage devices and are connected to the data storage devices via the switch in a manner to 
permit the removal of any one of the servers from any one of the sets of servers without the 
removal of any of the plurality of data storage devices. 



96. The system of Claim 91 wherein each of the at least first and second servers of 
any one of the sets of servers applies an application selected from the group consisting of: a mail 
application, a news application, a directory application, a content application, a groupware 
application, and an internet protocol (IP) service. 

97. The system of Claim 91 wherein each of the at least first and second servers of 
any one of the sets of servers applies an application, and wherein the system further comprises a 
load balancer operatively connected to each of the at least first and second servers of each of the 
sets of servers, the load balancer operative to route user requests to the one of the at least first 
and second servers of the sets of servers with the least load for a particular application. 

98. A survivable system for providing network processing and stored data access, the 
system comprising: 

(a) at least first and second servers operative to process at least first and 
second user requests, respectively, 

(b) a switch operatively connected to each of the servers; 

(c) a plurality of data storage devices operatively connected to the switch; 

(d) wherein each of the first and second servers applies an application, the 
application applied by the first server being substantially the same as the application 
applied by the second server such that, in the event of a failure of either of the first and 
second servers, any subsequent user requests will be processed by any other of the 
servers that are operable; and 



(e) wherein each of the plurahty of data storage devices stores data, the data 
stored by each of the plurality of data storage devices being substantially the same such 
that, in the event of a failure of any one of the plurality of data storage devices, the data is 
accessible from any other of the plurality of data storage devices that are operable. 

99. The system of Claim 98 wherein the data stored by any one of the plurality of data 
storage devices is associated with an application applied by any one of the first and second 
servers. 

100. The system of Claim 98 wherein each of the at least first and second servers 
applies an application selected from the group consisting of: a mail application, a news 
application, a directory application, a content application, a groupware application, and an 
intemet protocol (IP) service. 

101 . The system of Claim 98 ftirther comprising a load balancer operatively connected 
to each of the at least first and second servers, the load balancer operative to route user requests 
to the one of the at least first and second servers corresponding to the server with the least load. 

102. A scalable system for providing network processing and stored data access, the 
system comprising: 

(a) at least first and second servers operative to process at least first and 
second user requests, respectively; 

(b) a switch operatively connected to each of the servers; 



(c) a plurality of data storage devices operatively connected to the switch; 

(d) a load balancer operatively connected to each of the at least first and 
second servers, the load balancer operative to route user requests to the one of the at least first 
and second servers with the least load; and 

(e) wherein the servers operate independently of the data storage devices and 
are connected to the data storage devices via the switch in a manner to permit the inclusion of an 
additional server to process at least an additional user request without the inclusion of an 
additional data storage device, to permit the inclusion of an additional data storage device 
without the inclusion of an additional server, to permit the removal of any one of the servers 
without the removal of any of the data storage devices, and to permit the removal of any one of 
the data storage devices without the removal of any of the servers. 

103. The system of Claim 102 wherein each of the first and second servers applies an 
application, the application applied by the first server being substantially the same as the 
application applied by the second server such that, in the event of a failure of either of the first 
and second servers, any subsequent user requests will be processed by any other of the servers 
that are operable, and wherein each of the plurality of data storage devices stores data, the data 
stored by each of the plurality of data storage devices being substantially the same such that, in 
the event of a failure of any one of the pluraUty of data storage devices, the data is accessible 
fi-om any other of the plurality of data storage devices that are operable. 



104. A method for providing network processing and stored data access, the method 
comprising the steps of: 

(a) providing a server operative to apply an appHcation; 

(b) receiving a user request on the server; 

(c) applying the application to the user request to generate a query; 

(d) providing a data storage device configured to store data; 

(e) switching the query to the data storage device; 

(f) routing requested data from the data storage device to the server in 
response to the query; and 

(g) providing an additional server without providing an additional data storage 
device, or altematively, providing an additional data storage device without providing an 
additional server. 

105. The method of Claim 104 wherein the application is selected from the group 
consisting of: a mail application, a news application, a directory application, a content 
application, a groupware application, and an internet protocol (IP) service. 

106. A method for providing network processing and stored data access, the method 
comprising the steps of: 

(a) providing at least first and second servers operative to apply first and 

second applications, respectively; 

(b) receiving first and second user requests on the first and second servers, 

respectively; 



(c) applying the first and second applications to the first and second user 
requests, respectively, to generate first and second queries, respectively; 

(d) providing at least first and second data storage devices configured to store 
first and second data, respectively; 

(e) switching the first and second queries to the first and second data storage 
devices, respectively; 

(f) routing first requested data fi-om the first data storage device to the first 
server in response to the first query, and routing second requested data fi"om the second 
data storage device to the second server in response to the second query; and 

(g) providing an additional server without providing an additional data storage 
device, or alternatively, providing an additional data storage device without providing an 
additional server. 

107. The method of Claim 106 wherein each of the first and second applications is 
selected from the group consisting of: a mail application, a news application, a directory 
application, a content application, a groupware application, and an internet protocol (IP) service. 

108. The method of Claim 106 wherein the first application is substantially the same as 
the second application. 

109. The method of Claim 108 further comprising the step of: 

(h) in the event of a failure of either of the first and second servers, processing 
any subsequent user requests on any other of the servers that are operable. 



1 10. The method of Claim 106 wherein the first data is substantially the same as the 
second data. 

111. The method of Claim 1 1 0 further comprising the step of: 

(h) in the event of a failure of either of the first and second data storage 
devices, providing any subsequent requested data fi-om any other of the data storage devices that 
are operable. 

112. The method of Claim 106 wherein the first application is substantially the same as 
the second application, and wherein the first data is substantially the same as the second data. 

113. The method of Claim 1 1 2 fiirther comprising the steps of: 

(h) in the event of a failure of either of the first and second servers, processing 
subsequent requests on any other of the servers that are operable; and 

(i) in the event of a failure of either of the first and second data storage 
devices, providing any subsequent requested data fi-om any other of the data storage devices that 
are operable. 

114. A method for providing network processing and stored data access, the method 
comprising the steps of 

(a) providing at least first and second servers operafive to apply first and 
second applications, respectively, the first application being substantially the same as the second 
application; 



(b) receiving first and second user requests on the first and second servers, 

respectively; 

(c) applying the first and second applications to the first and second user 
requests, respectively, to generate first and second queries, respectively; 

(d) providing at least first and second data storage devices configured to store 
first and second data, respectively, the first data being substantially the same as the second data; 

(e) switching the first and second queries to the first and second data storage 
devices, respectively; 

(f) routing first requested data from the first data storage device to the first 
server in response to the first query, and routing second requested data from the second 
data storage device to the second server in response to the second query; 

(g) in the event of a failure of either of the first and second servers, processing 
any subsequent requests on any other of the servers that are operable; and 

(h) in the event of a failure of either of the first and second data storage 
devices, providing any subsequent requested data from any other of the data storage 
devices that are operable. 

115. The method of Claim 1 14 wherein each of the first and second applications is 
selected from the group consisting of: a mail application, a news application, a directory 
application, a content application, a groupware application, and an internet protocol (IP) service. 
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MCTHOD AND SYCTEM FOR PROVIDING 
FAULT TOLERANCE TO A CONTINUOUS 
MEDIA SERVER SYSTEM 

CROSS-REFERENCE TO RELAIED 
APPUCAnON 

This application is a continuation of U^. patent applica- 
tion Ser. No. 08/437,935, filed May 9, 1995, now aban- 
doned. 

TECHNICAL FIELD 

The present invention relates generally to data processing 
systems and, more particularly, to fault tolerance in a con- 
tinuous media server system. 

BACKGROUND OF THE ESfVENTION 

Some conventional data processing systems use a tech- 
nique known as "minoring** in crder to continue operating 
when a storage device fails. Mirroring refers to a technique 
where for every storage device ("primary storage device'*) in 
a data processing system, the data processing system main- 
tains a mirror storage device. The mirror storage device is a 
storage device that contains a duplicate copy of the data on 
the primary storage device. Whenever an operation is per- 
formed on Ifae primary storage device that would alter the 
data contained thereon (e.g., write), the same operation is 
performed on the mirror storage device. Thus, at any given 
time, the mirror storage device has an exact duplicate copy 
of all the data on the primary storage device. 

Since the mirror storage device has an exact dupKcatc 
copy of all the data on the primary storage device, if the 
primary storage device fails, the data processing system 
switches to use the mirror storage device and the operation 
of the data processing system continues with little intemip- 
tion. Although miirocing provides for a more reliable data 
processing system, the mirroring tedmique is not suitable 
for all types of data processing systems since there must be 
a duplicate of every storage device on the system and since 
some interruption of the data processing system typically 
occurs. 

One example of a data processing system where an 
interruption would not be acceptable, even for a short period 
of time, is a continuous media server system. A continuous 
media server system is a data processing system that typi- 
cally has many storage devices and delivers data at a 
constant rate to subscribers for the data. In this context, the 
phrase "constant rate" refers to delivering the appropriate 
amount of data to a subscriber over a period of time, such as 
a second. 

SUMMARY OF THE INVENnON 

A method and system is provided for tolerating compo- 
nent faUure in a continuous media sender systeuL The 
present invention guarantees data streams at a constant rate 
to subscribers for the data streams even when at least one 
coii^>onent fails. The present invention is able to guarantee 
data streams at a constant rate by utilizing dednstered 
mirroring and by reserving bandwidlh for both normal mode 
processing and failure mode processing. The dedustered 
mirroring of the present invention is performed by dividing 
the data to be stared in the continuous media server system 
into blocks. The blocks are then striped across the storage 
devices of the continuous media serv^ system and each 
block is divided into a number of sub-blocks. The sub- 
blocks are in turn stored on separate storage devices. The 
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present invention reserves bandwidth for both normal mode 
processing and failure mode processing. Since the present 
invention utilizes dcclustered minoring, the bandwidth 
reserved for failure mode proccssiiig is reduced. Therefore, 

s y/hcD a failure occurs, the bandwidth leserv^ed for failure 
mode processing is utilized and the data streams to the 
subscribers arc uruntenupted. 

In accordance with a first aspect of the present invention, 
a system is provided for delivering data to consumers at a 

10 constant rate. In accordance witii this system of the first 
aspect of the present invention, the system coii^rises a 
plurality of sequentially numbered storage devices and a 
send com^onenL The plurality of sequentially numbered 
storage devices contain data wherein the data comprises 

t^ blocks and sub-blocks and the data is striped across the 
storage devices. A block is divided into a predefined number 
of sub-blocks and sub-blocks for a block on a first storage 
device are stored on the predefined number of storage 
devices tiiat numerically follow the first storage device. The 

20 send component is for sending the blocks from the storage 
devices to the consumers and when a storage device fails, for 
sending the sub-blocks from the predefined number of 
storage devices that numerically follow the storage device 
that failed. 

^ In accordance witii a second aspect of the present 
invention, a method is provided in a continuous media 
system for delivering data to consumers at a constant rate. 
The continuous media system has a plurality of numerically 
sequential storage devices for storing data. The storage 

^ devices have a primary portion and a secondary portion and 
the data conq}rises numraically sequential blocks that are 
striped across the storage devices. In accordance with this 
method of the second aspect of the present invention, the 
blocks are stored on the primary portion of the storage 
devices such that after storing a block a next numerically 
sequential block is stored on a next numerically sequential 
storage device, the blocks are divided into a predefined 
numt)er of sub-blocks and for each block, the sub-blocks for 
the block arc stored on the secondary portion of the pre- 

^ defined number of storage devices that numerically follow a 
storage device on which the block is stored. 

BRIEF DESCRIFnON OF THE DRAWINGS 

45 FIG. 1 is a block diagram of a video-on-demand system 
of a preferred embodiment of the present invention. 

FIG. 2 is a more detailed block diagram of the cable 
station of FIG. 1. 

FIG. 3Ais a partial plan view of a storage device of FIG. 
50 2 of the preferred embodiment of the present invention. 

FIG. 3B is a partial plan view of a storage device of FIG. 
2 of an alternative embodiment of the present invention. 

FIG. 3C is a diagram depicting an example of storing data 
utilizing dedustered mirroring on the storage devices of the 
preferred embodiment of the present invention. 

FIG. 3D is a diagram depicting an example of storing data 
utiHring a first alternative embodiment of the present inven- 
tion. 

FIG. 3E is a diagram depicting an example of storing data 
utilizing a second alternative embodiment of the present 
invention. 

FIG. 4 is a diagram illustrating the scheduling of band- 
width in a three disk drive system in accordance with the 
65 preferred embodiment of the present invention. 

FIG. 5 is a diagram illustrating an example of the sdied- 
uling of bandwidth in the three disk drive system of FIG, 4 
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when a disk drive fails in accordance with the preened 
embodiiDcist of the present invention. 

FIG. 6 dq>icts a hig^-level flow chart functionally illus- 
tF&ting the steps poformed by tiic prefeired embodiment of 
the present invention. 

FIG. 7 depicts a flow chart of the stq)s performed by the 
prefeired embodiment of the present invention for striping 
data across the storage devices. 

HG. 8 depicts a flow chart of the stq;>s performed by the 
prefoied embodiment of the present invention when trans- 
mitting data in nomial mode processing and failure mode 
processing. 

DETAILED DESCRITOON OF THE 
INVENTION 

The prefeired embodiment of the present invention pro- 
vides a method and system for tolerating component failure 
in a continuous media server system by utilizing declustered 
miiroring and by reserving bandwidth for failure mode 
processing. By utilizing the preferred emt>odiment of the 
present invention, subscribers to the continuous media 
server system are guaranteed a data stream at a constant rate 
even when one or more con^nents of the continuous media 
server system fail. One example of a continuous media 
server system is a video-on-demand system where subscrib- 
ers request video image sequences, sudi as movies, and the 
vidc(M>n-demand system guarantees a data stream of the 
video iaiage sequences to the subscril>ers at a constant rate. 
In addition, the video-on-demand system can send a data 
stream of audio data to subscribers. In a video-on-demand 
system, it is iix4>ortant to guarantee data flow at a constant 
rate. Otherwise, when a subscriber is viewing a movie and 
a failure occurs, the movie will appear to have an interrup- 
tion. Thus, the preferred embodiment can also be thought of 
as {seveatiDg data flow intemiptions. 

In the video-on-demand system of the indent invention, 
the video-on-demand system has a number of storage 
devices where the data fcr the video image sequences is 
stared as blocks that arc striped across the storage devices. 
The term **striped" refers to staring blocks sequentially 
across the storage devices and when the last storage device 
is reached, wrapping around and storing the next block on 
the first storage device. The data stream is sent to subscribers 
by each disk sending a next sequendal block of data to the 
subscriber, one at a time. As previously stated, the prefeired 
embodiment of the present invention uses dedustcred mir- 
roring in order to guarantee a data stream at a constant rate 
to a subscriber. In this context, "mirroring" refers to storing 
both a primary copy of a block of data and a secondary copy 
of a block of data where each copy of the block of data is 
stored 00 a sq>arate storage device. The term "declustered" 
refers to dividing the secondary block of data into a number 
of sub-blocks where each sub-block is stored on a separate 
storage device. By placing the sub-blocks across many 
storage devices, when the stcH'age device containing the 
primary block fails, the txirdcn of transmitting the secondary 
block of data is shared among many storage devices, thereby 
lessening the effect of failure mode processing on each 
storage device. By using dedustered mirroring, the prefeired 
embodiment of tiie present invention guarantees that one 
con^wnent, either a storage device or a server of a storage 
device, can fail and the data stream is unaffected. A **server" 
of a storage device is responsible fCT managing the storage 
device. As will be described in further detail below, the 
prefened embodiment can tolerate more than one compo- 
nent failure under certain circumstances. 
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Id addition to utilizing dedustcred mirroring, the pre- 
ferred embodiment reserves bandwidth so as to be able to 
guarantee data streams to subscribers at a constant rate. The 
term ''bandwidth'* is intended to refer to Ihe input/oulput 

5 capacity (for a fixed time frame) of storage devices that hold 
data for video image sequences. The video-on-demand sys- 
tem of the present invention will be described bdow relative 
to an in^dementation that concerns ou^Hit bandwidth (Lc, 
reading data from storage devices holding video image 
sequences), but those skilled in the art will appreciate that 
the present invention may also be applied to input bandwidth 
as well (i.e., writing video image sequence data to storage 
devices). The prefeired embodiment reserves bandwidth for 
both normal mode processing and failure mode processing. 
Normal mode proces sing refers to a mode of operation of the 

13 video-on-demand system wherein no component failures 
occur and failure mode processing refers to a mode of 
operation of the video-on-demand system wherein a com- 
ponent failure occurs. The prefeired embodiment allocates a 
time slot to each subscriber request for video image 

20 sequences. This time slot is representative of a bandwidth 
unit (ie., a unit of system bandwidth) of the video-on- 
demand system and is divided into two parts: a primary 
period and a secondary period. The primary period may be 
viewed as reserved bandwidth for normal mode processing 

25 and the secondary period can be thought of as reserved 
bandwidfli for failure mode processing. The secondary 
period is typically not used for sending data in normal mode 
processing of the prefeired embodiment Instead, the sec- 
ondary period is used in failure mode processing for sending 

30 sub-blocks of data in order to compensate for the failure of 
a con^nent Thus, by reserving bandwidth for failure mode 
processing, the data stream to a subscriber is unaffected 
when a failiue occurs. 

Further, the preferred embodiment of the present inven- 

35 tion reduces the size of overall time slot necessary for 
normal mode and failure mode processing. That is, the 
preferred embodiment has a technique ("storage device 
scgmenution") forredudng the amount of time that must be 
reserved for both normal mode processing and failure mode 

40 processing, thereby increasing the total bandwidth of the 
system. The preferred embodiment reduces the overall time 
slot by dividing the storage devices into a primary portion 
and a secondary portion. The primary portion of the storage 
device contains the primaiy blocks of data and the secondary 

43 portion of the storage device contains the sub-blocks of data. 
The prefened embodiment designates the primary portion as 
the faster region (typically the outer region) of the storage 
device and designates the secondary portion as the slower 
region (typically the inner region) of the storage device. 

50 Thus, the preferred embodiment takes advantage of the 
increased data transfer rates on the faster regions of a storage 
device. That is, by using storage device segmentation, the 
majority of data transferred during a time slot is retrieved 
from the outer region of the storage device that has a faster 

ss data transfer rate than the inner region of the storage device. 
This technique exploits the fact that storage devices, such as 
hard disks, typically have a platter with many concentric 
tracks. The outermost tracks are larger than the inner tracks 
and thus can store more data. In addition, the platter spins at 

60 a constant rate. Thus, in one revolution of the platter, the 
outermost tracks can transfer more data than the inner tracks. 
Tlierefore, the outermost tracks have a faster data transfer 
rate than the inner tracks. Although storage device segmen- 
tation has been described rdative to a hard disk, one skilled 

65 in the art will appreciate that storage device segmentation 
can be used with any device having a faster region and a 
slower region. 
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In summary, the preferred embodiment of the present subsystems to subscribers 14. The controller may be dupli- 

invention guarantees data streams to subscribers at a con- catcd to provide a backup controller that enhances the fault 

slant rate. In order to do diis* the prefened embodiment tolerance of the system. In addition, one skilled in the art 

stores data using dectustered mirroring and reserves band- will appreciate that the functioning of die controller can be 

width up front for both nonEal mode processing and failure 5 distributed across the subsystems, thereby eliminating the 

mode processing- Further, the preferred embodiment uses need for a controller. Although only three subsystems are 

storage device segmentation to reduce the amount of band- shown in FIG. 2» those skilled in the art wiU appreciate that^ 

width that must be reserved for normal mode processing and in most instances, it is more suitable to en^oy a larger 

faihire mode processing. By reducing the amount of band- number of subsystems. Only three subsystems are shown in 

width that must be reserved, Ihc video-on-demand syston is FIG. 2 for purposes of sic^Iidty and darity. 

more efficient and can service more subscribers. Although Eadi subsystem 18A, ISB, and ISC includes a micropro- 

the preferred embodiment of the present invention is cesser 20A,20B, and 20C that is responsible for controlling 

described below with reference to a video-on-demand respective pairs of storage devices (22A, 24A), (22B, 24B) 

system, one skilled in the art will q)preciate that the present and (22C, 24C). The data for the video image sequences that 

invention can be used with any continuous media server are available to the subscribers 14 are stored on the stcrage 

system or, more generally, any system wherein a data stream devices 22A, 24A, 22B, 24B, 22C and 24C. Each subsystem 

must be delivered at a constant rate. 18A, 18B, and 18C need not include two storage devices, 

In describing the preferred embodiment of the present ratiier each subsystem may include only one storage device 

invention, the description is presented in three parts. First, a or may, alternatively, indude more than two storage devices, 

description of the hardware components is presented. ^ The microprocessors 20A,20B, and 20C are responsible for 

Second, a desci^tion of the data structures used by the cooperating with the controller 16 to transmit the data for the 

preferred embodiment is presented. Hiird, the step-by-step video image sequences stored on the storage devices to the 

processing of the preferred embodiment is presented with subscribers 14. 

accompanying flowcharts to illustrate the interrelationships Storage devices 22A, 22B, 22C, 24A, 24B and 24C may 

between the hardware components and the data structures as 25 be, for instance, magnetic disk drives or optical disk drives, 

well as to illustrate overall processing of the preferred Those skilled in the art will appreciate that any suitable 

embodiment of the present invention. storage device may be used for storing the data for the video 

With respect to the hardware con^onents, the preferred image sequences. For instance, RAM, masked ROM, 

embodiment of the present invention is adapted for use in a EFROM and flash EFROMs may be used to store the video 

video-on-demand server system like that shown in HG. 1. 30 image sequences in the present invention. 

The system depicted in FIG. 1 is a video-on-demand server HG. 3A depicts a portion of storage device 22A of FIG. 

system in which subscribers may request at any point in time 2 in more detaiL Storage device 22A is described with 

to view particular video image sequences transmitted from reference to bang a disk storage device. Although storage 

the cable station 10. The cable station 10 transmits die data device 22A is depicted, the other storage devices 24A, 22B, 

for the video image sequences over the interconnection 35 24B, 22C, and 24C are similar. Storage device 22A has an 

netwoik 12 to the subscribers 14. The interconnection net- outer region 302 and an irmer region 304. The outer region 

work 12 may he any suitable interconnection mechanism, 302 is also known as the primary portion and the inner 

induding an asynchronous transfer mode (ATM) network. region 304 is also known as the secondary portion. As 

Functionally, the interconnection network 12 acts like a previously described, the data transfer rates fcr dte outer 

CTOSspoint, banyan or other switch tc^logy. The cable 40 region 302 far exceed those of the inner region 304. Also, as 

station 10 preferably makes available a large number of previously described, the video image sequences are divided 

different video image sequences that may be transmitted to into sequential blocks of data that are striped aaoss the 

the subscribers 14 and viewed in real time. The data for the primary portions of the storage devices. Block size is 

video image sequences may contain video data, audio data variable, but typicaUy a block includes 64 kilobytes to 4 

and other types of data, sudi as closed c^)tioning data. The 45 megabytes of data. Block size is bounded by an upper limit 

present invention may also be q>pUed soldy to audio data or that may not be exceeded. Striping the blocks of data refers 

other types of data sequences. to storing a first block of data on a first storage device and 

For such a video-on-demand server system, the choice of each sequentially following block of data is stored on the 

video image sequence viewed by a subscriber is not pre- next sequential storage device. When reaching the last 

scheduled. Mewing choices are scheduled upon subscriber 50 storage device, the preferred embodiment wraps around and 

demand. A subscnber need not choose a video image stores the next block of data on die first storage device. This 

sequence that other subscribers are watdiing; rather, the str^ing continues until all the blocks of data are stored 

subscriber may choose from any of the available video across the storage devices. By staring the blocks on the 

image sequences. Furthermore, each subscriber diooscs primary portion of a storage device, it guarantees faster data 
when he wishes to start viewing a video image sequence. A 55 transfer rates for the majority of the data that a stCH-age 

number of different subscribers 14 may be concurrentiy device transfers. 

viewing different portions of the same video image After storing the primaiy blocks of data on the primary 
sequence. A subscriber may sdect where in a sequence he portions of the storage devices, the preferred embodiment of 
desires to start vtewing and can stop watching a sequence the present invention then stores data onto the secondary 
before the entire sequence has t»een viewed. $0 portions of the storage devices by utilizing dedustered 
FIG. 2 is a block diagram showing the cable station 10 in mirroring. Although the preferred embodiment is described 
more detaiL The cable station 10 is a video-on-demand as storing data on the secondary portions of the storage 
server. The cable station 10 includes a controller 16 that is devices after storing data on the primary portions of the 
responsible for sdieduling transmission of video image storage devices, one skilled in the art will appreciate that 
sequences to subscribers 14 (FIG. 1). The controller 16 es data can be stored on the primary portions after the second- 
controls several subsystems 18A, 18B, and 18C and is ary portions or data can be stored on the primary portions 
responsible f<x scheduling and directing output from the and the secondary portions simultaneously. The data on the 
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secondary portion is used during failure mode processing. ait win appreciate that additional video image sequences can 

For each block of dau on the primary portion of a storage be stored in this manner by the present invention, 

device, the block of data is divided into ''D" sub-Mocks, By utilizing dcclustcrcd mirroring as shown in HQ. 3C, 

where *T)" refers to a dcclustcring numbct That is, Ihc the video-on-demand system of the present invention can 

declustaing nuniber is the number of storage devices across 5 tolerate a storage device failure and continue operating in a 

which the sub-blodcs are stored. As the dechistering nuinber seamless manner (i.e., without interruption). However, the 

is increased, the number of storage devices that are used for pr-rferred embodiment of the present invention can also 

transmitting the sub-blocks of data during failure mode tolerate the failure of a subsystem in a seamless manner 

jrocessing is inacascd, which lessens Ac burden of per- without intcnnpting the data stream to the subscriber. Wth 

fooning failure mode processing by each storage device, jq respect to transferring data, the preferred embodiment just 

However, the greater the declustcting number, the greater treats die storage devices of the failed subsystem as having 

the ratio of network and other system overhead to tfie failed. 

amount of data transferred. Althou^ the preferred embodi- In order to tolerate the failure of a subsystem, the pre- 
mcnt of the |]resent invention uses a declustering numba of f erred embodiment of the present invention assigns numbers 
8, one skilled in the art will appreciate that other decluster- 15 to the storage devices in a particular manner. Tht numbers 
ing numbers can be used by the present invention. Anotha arc assigned by first sequentially numbering each subsystem 
benefit associated with using a higher declustering number from 1 to N. The number assigned to a subsystem can be 
is that as the declustering number is increased, the amount expressed by the variable **i." Each storage device for a 
of bandwidth that is reserved for failure mode processing is subsystem i is then assigned a number as follows: i, irt-i, 
reduced (Le., the secondary poiod of the time slot), 2n+i ... until all storage devices are numbered. For exan^jle. 
Therefore, since the primary period of the time slot transfers when numbering the storage devices of FIG. 2, storage 
data at a faster rate than the secondary period of the time device 24Amay be considered the first storage device, with 
slot, a greater declustering number means less data is being storage device 24B being the second, storage device 24C 
transf axed from the slower part of the storage device and being the third, storage device 22A being the fourth, storage 
thus a fasttt overall data transfer rate is realized. In turn, the 25 device 228 being the fifth, and storage device 22C being the 
faster the data transfer rate, the smaller the amount of srtth. Therefore, if a declustering number of 2 is used with 
bandwidth that must be reserved by the system and the more the system depicted in FIG. 2 and with the storage devices 
subscribers can be serviced by the system. being numbered as previously described, the blocks on 
FIG. 3B depicts a more derailed diagram of storage device storage device 24A are divided into sub-blocks that are 
22A of FIG. 2 in an alternative embodiment of the present 30 stored on storage device 24B and storage device 24C. 
invention. Storage device 22A is described with reference to Further, the blocks stored on storage device 22A are stored 
being a disk storage device. In the alternative embodiment, as sub-blocks on storage device 22B and storage device 22C. 
the innermost region of the storage device 22A is an unused Therefore, if subsystem 18A were to fail, the storage devices 
region 305, The unused region 305 has tfie slowest data of subsystems 18B and 18C are able to transmit the data that 
transfer rate of the storage device 22A and is thus unused so 35 would have been transmitted by the storage devices of 
as to inaease the data transfer rate of the overall storage subsystem 18A and, therefore, Ac data stream to the sub- 
device. The outer region 302 and the inner region 307 are scribers is not intermpted. 

accordingly smaller in size. Although the preferred embodiment of the present inven- 

FIG. 3C dqncts an exan^lc of dcclustercd mirrcring of tion has been described as tolerating the failure of one 

the preferred emtxxllmcnt of the present invention. FtG. 3C 40 storage device or one subsystem, one skilled in the art will 

depicts three storage devices 306, 308, 310 with each appreciate that as the number of subsystems increases and 

storage device having a primary portion 312, 316, 320 and the number of storage devices increases, a dcclustcring 

a secondary portion 314, 318, 322. In this exaiii5)le, the video number can be chosen so that more than one storage device 

image sequences are comprised of three blocks of data, or subsystem can fail witiiout interrupting the data stream to 

block A, block B, and block C which are stored on the 45 the subscribers. That is, in the preferred embodiment of the 

primary portions 312, 316, 320 of the storage devices, presentinvention,iffailedstOTage devices or subsystems are 

icspectively. In this exanQ>lc, the declustering number is 2 spread out with no less than "D" storage devices between the 

and, therefore, block A is divided into two sub-blocks with failed components, no interruption of data streams to sub- 

thc first sub-block Al being stored on the secondary portion scribers occurs. 

318 of storage device 308 and the second sub-block A2 50 The declustered mirroring of the present invention has 

being stored on the secondary portion 322 of storage device altemative embodiments of which two arc described below. 

310. Block B is divided into two sub-blocks, Bl and B2, The first aUcmative embodiment spreads the burden of 

which are stcaed on the secondary portions of storage perfoming failure mode processing across more storage 

devices 310, 306, respectively. Also, block C is divided into devices than the preferred embodiment, thereby lessening 

two sub-blocks, CI and C2, which are stored on the sec- 55 the effect of failure mode processing on any one storage 

ondary portions of storage devices 306, 308, respectively. device. The first alternative embodiment sequentially num- 

Therefore, by striping the data on the primary portions of the bers each subsystem from 1 to N. Then, for each subsystem 

storage devices and storing the sub-blocks on the secondary 1 to N, each storage device is sequentially numbered. For 

portions of the storage devices, if a failure occurs to storage example, if subsystem 1 had three storage devices, these 

device 308, storage device 310 and storage device 306 can €0 storage devices would be numbered 1, 2 and 3, respectively, 

each send sub-blocks Bl and B2 so that the data stream to The second subsystem would then number its storage 

the sut>sciibcr is not interrupted. Although a video image devices starting with the number 4 and so on until all of the 

sequence has been described as comprising three blocks of storage devices for each of the subsystems are numbered, 

data, one skilled in the art will ^jpredate that a video inoage After numbering all of the storage devices in this manner, 

sequence can comprise many blocks of data. In addition, 65 the blocks for the lowest numbered storage device for a 

although only one video image sequence has been described subsystem arc stored on the secondary portion of the lowest 

as being striped across the storage devices, one skilled in the numbered storage device for the "D" subsystems that follow 
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tbc sabsystenL The blocks on the primary partion of the next In this figure, there are six storage devices, SDl, SD2, SD3, 

sequentiaUy numbered storage device on the subsystem are SD4, SD5, SD6 that are divided into two clusters, cluster 1 

spKt into D sub-blocks and arc then stored on the next to and duster 2. The declustcring number used in this example 

lowest numbered storage device of the D+1 through 2D is 2. As can be seen from the figure, the block stared on the 

subsystems that follow the subsystem. Therefore, the blocks 5 primary portion of a storage device within a duster is 

on the lowest numbered storage device would be stored divided into sub-hlodcs that are stored on the secondary 

across the D following subsystems on the lowest numbered portions of the other storage devices within the cluster. For 

storage device of each subsystem and the blocks on the next example, block A is stored on the primary portion of SDl 

sequential storage device would be stored across the Df 1 and is divided into two sub-blocks, Al and A2, which are 

through 2D following subsystems on the next to lowest jq stored on the secondary portions of SD2 and SD3, respec- 

numbcrcd storage device of each subsystcnL This process is lively. Similarly, block B stored on the primary p«tion of 

continued until all blocks on each storage device are stored SD2, is divided into sub-blocks Bl and B2, which are stored 

as subblocks. on the secondary portions of SD3 and SDl, respectively. 

The first alternative embodiment is perhaps best described Furthermore, block C, stored on the primary portion of SD3, 

by way of an cxan^)lc, which is provided in FIG. 3D. In FIG. 55 is divided into sub-blocks CI and C2, whidi arc stored on 

3D, there are five subsystems, subsystem 1, subsystem 2, the secondary portions of SDl and SD2, respectively, 

subsystem 3, subsystem 4 and subsystem 5 with each Although two alternative embodiments have been 

subsystem having two storage devices, SDl, SD2, SD3, described, one skilled in the art will apfredate that other 

SD4, SDS, SD6, SD7, SD8, SD9 and SDIO. Eadi storage numbcrings or groupings of the storage devices can be used 

device has a primary portion for storing blocks and a 20 by the present invention. Further, one skilled in the art will 

secondary portion for storing sub-blocks. In this example, a appreciate that both the blocks and sub-blocks can be stored 

dedustcring number of two is used. As can be seen in FIG. in a different manner by the present invention. 

3D, block A, stared on the primary portion of SDl, is With respect to the data structures used by the preferred 

divided into two sub-blocks, Al and A2, which are stored on embodiment, scheduhng for each storage device is done on 

the secondary portions of SD3 and SD5, respectively. Block 25 a colrnnn of time slots. Each column includes a number of 

B, stared on SD2, is divided into two sub-blocks, Bl and B2, time slots in a sequence &at repeats. Each time slot is a 

whidi are stored on the secondary portions of storage bounded period of time that is sufiSdent for the storage 

devices SD8 and SDIO, respectively. By utilizing the first device to output a block of data. One time slot from each 

alternative embodiment, the burden of performing failure column of time slots togetha comprise a bandwidth unit. A 

mode processing is divided amongst many disks. For 30 bandwidth unit is a unit of allocation of bandwidth of the 

example, when subsystem 1 fails, the load for perforrcing video-on-demand system of the present invention and is 

failure mode processing is equally distributed over sub- used to transfer data. Each time slot in the bandwidth unit is 

s>'stems 2, 3, 4 and 5, and storage devices SD3» SD5, SD8 assodated with a different storage device that ouq)uts a 

and SDIO. block of data of a video image sequence. Since the blocks of 

The second alternative embodiment of dcclustered mir- 35 data arc striped across the storage device, consecutive blocks 

roring of the present invention reduces the vulnerability of of data are read fr-om the predetermined sequence of storage 

the system to the failure of two or more conq)oncnts. As devices during the sequence of time slots of the bandwidth 

previously stated, the prefeaxed embodiment can tolerate a unit The time slots are generated by the controller 16 or 

second continent failure if the second corrq)onent is not other suitable mechanism (FIG. 2). 

within D storage devices of the component that failed. That 40 The notions <rf a colunm of time slots and a bandwidth 

is, the system caimot tolerate a component failure within the unit can perhaps best be explained by way of example, 

following **D** storage devices or the preceding **D" storage Subscribers are scheduled by bandwidth unit In other 

devices from the component that failed. The prefeued words, they are granted the same numbered time slot in each 

embodiment cannot tolerate the failure of a conoponent colunm. FIG. 4 shows the scheduling of seven subscribers 

within D following storage devices since the blocks for a 45 for three storage devices (e.g., diskl,disk2anddisk3).The 

storage device are stored on the D following storage devices. rectangles (e.g., 400) shown in FIG. 4 are time slots. Each 

The preferred embodiment cannot tolaate the failure of a time slot has a primary period (e.g., 402) and a secondary 

second component within D preceding storage devices since period (e.g., 404). The primary period of the time slots is for 

a storage device stores sub-blocks for the D preceding sending data from the primary portion of the storage device 

storage devices. Therefore, the preferred embodiment is 50 andthesecondaryportionof the time slot is for sending data 

vulnerable to the failure of 2D storage devices. from the secondary portion of the storage device. The 

The second alternative embodiment of declustered mir- numbers 1-7 in FIG. 4 correspond to the time slot in the 

roring reduces the vulnerability of the system to the failure respective columns 1, 2 and 3. Time slots of a common 

of two or more components by dividing the storage devices bandwidth unit all have the same number. Columns 1, 2 and 

into groups of clusters. A "duster'* is a group of storage 55 3 are aU offset temporally relative (ie., a time unit in FIG. 

devices containing D+1 storage devices. For each blodc on 4) to cadi other, but each colunm has the same sequence of 

a storage device in a cluster, the block is divided into D time slots. As can be seen in FIG. 4, disk drive 1 services 

sub-blocks and is stored on the other storage devices within eadi of the subscribers in scqueooe bcgirming with the 

the cluster. As such, by utilizing the second alternative subscriber who has been allocated logical unit of bandwidth 

embodiment, if a storage device fails within a cluster, the 60 1. In the example of FIG. 4, bandwidtti unit 1 includes the 

system can continue operating without interruption even if a time slots labded 1 in columns 1, 2 and 3. During the slot 

second storage device fails, as long as the second storage 1 of colunm 1, disk drive 1 begins outputdng a block of data 

device is not within the cluster of the failed storage device. for a video image sequence to a first subscriber that has been 

Therefore, the second alternative embodiment is vulnerable assigned bandwidth unit 1. One time unit later, disk drive 2 

to the failure of D+1 storage devices and, as such, increases 65 outputs the next block of data to the first subscriber during 

the tolerance of the system for multiple failures. An example time slot 1 of column 2. Further, at time unit 2, disk drive 3 

of the second alternative embodiment is depicted in FIG. 3E. outputs the next block of data for the video image sequence 
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to the subsoibex during tijnc slot 1 of column 3. The step, the prcferrcd embodiment determines the storage 
predefined sequence of storage devices in this example is device on which the initial block to be viewed in the video 
disk drive 1, disk drive 2 and disk drive 3, with the sequence image sequence is stored for a particular subscnbet If the 
wrapping back around to disk drive 1 from disk drive 3. subscriber is viewing the video image sequence from the 
FIG. 5 d^icts the columns of time slots of FIG. 4 after a 5 beginning of the sequence, the initial block is the first block 
failure of disk 2 has been detected. After a failure is detected in the sequence. However, where the subscriber desires to 
by the preferred embodiment of the present invention, the view the video image sequence beginning at some interme- 
Uocksthatwouldnorroallybesentby the storage device that diate point, the initial block is the first block that the 
failed are sent as sub-blodcs by the **D'* following disks. For subscriber desires to view. Once the storage device that 
cxanq)le,FIG. 5 depicts an example of disk 2 failing with a 10 holds the initial block of the requested video image sequence 
declustoing number of 2. That is, all of the blocks contained to be viewed has been identified, the prefened embodiment 
on the primary portion of disk 2 are stored as sub-blocks on of the present invention selects a bandwidth unit that may be 
the secondary portions of disk 1 and disk 3. During the used to transmit the video data of the requested video image 
primary period of the time slots for both disk 1 and disk 3, sequence to the requesting subscriber. The preferred 
processing is performed as normal That is, for example, embodiment of the present invention selects the next band- 
disk 1 in the primary period of the first time slot sends a width unit that is available (i.e., unallocated to a subscriber), 
block destined for subscriber L However, after detecting a The scheduling of subscriber requests and, more generally, 
failure, the secondary periods of the time slots for both disk the video-on-demand system of the present invention is 

1 and disk 3 are used for sending the sub-blocks that when more clearly described in VS. patent application Ser. No. 
combined comprise the block ("aggregate block") that 20 08/159,188, entitled '"Method and System for Scheduling 
should have been sent by the failed disk. For cxanq>le, disk the Transfer of Data Sequences," which is hereby incccpo- 

2 rfiiring the first time slot was to send a block destined fcx- rated by reference. Alternatively, the preferred embodinoent 
subscriber 6. Since disk 2 has failed, during the secondary of the present invention may sdiedule subscriber requests as 
period of the first time slot of disk 1, disk 1 sends tiie first described in U.S. patent application Ser. No. 08/349,889, 
sub-block of the block destined for subscriber 6 (e.g.. 6.0). 25 entitled "Method and System for Scheduling the Transfer of 
In adxlition, during the secondary period of the first time slot Data Sequences Utilizing an Antidustering Scheduling 
of disk 3, disk 3 sends the second sub-block that is to be sent Algcrithm,"* which is hereby incorporated by reference, 
to subscriber 6 (c^g., 6.1). Therefore, using this method. After scheduling subscriber requests, the preferred cmbodi- 
subscriber 6 receives the block of data as an aggregate block ment of the present invention traiLSXQits blocks of data in 
without an intexniption in the data stream. In other words, 30 sequence to the subscribers (step 608). In this stq>, the 
the f^ftfa stream to all the subscribers scheduled for disk 2 prefened embodiment accesses the columns of time slots 
will be uninterrupted when a failure occurs of disk 2. To the and transmits the blocks of data to the subscribers. In 
subscriber, no interruption in service is noticed and therefore addition, if a component fails, the preferred embodiment 
the subscriber is unaware that a failure has occurred. switdies to failure mode and continues transmitting blocks 

Wth respect to the step-by-step processing performed by 35 of data in sequence to the subscribers without the subscrib- 

die prefored embodiment, FIG. 6 depicts a flowchart func- ers noticing a disruption in the data stream. With regard to 

tionally illustrating the steps performed by the preferred a particular video image sequence, the blocks of data are 

embodiment of the jresent invention. The preferred cmbodi- transmitted until either the end of the video image sequence 

ment of the present invention is responsible for assigning or until the subscriber requests the video image sequence to 

numbers to the storage devices, storing dau on the storage 40 stop. This step is described in more detail below. Although 

devices, receiving subscriber requests, scheduling the sub- the steps of FIG. 6 have been described with a specific wdcr, 

scriber requests, and transmitting blocks of data in sequence one skilled in the art will appreciate that two or more of the 

to the sutTscribers during both normal mode processing and steps may be performed concuaently or in a different order, 

faihire mode processing. The first stq> performed by the For example, while the preferred embodiment is transmit- 

prefcrred embodiment of the present invention is to assign a 45 ting blocks of data in sequence, the preferred embodiment 

number to each storage device (step 6#1). In this step, the can receive more subscriber requests and schedule those 

prefened embodiment assigns a number to each storage subscriber requests. 

device as previously described where a sequential number is FIG. 7 depicts a flowchart of the steps performed by the 

assigned to one storage device of each subsystem. After prefcrredembodimentof the present invention when striping 

assigning a sequential number to one storage device of each so data for a video image across the storage devices. Although 

subsystem, the preferred embodiment wraps around and the striping of data is described for only one video image, 

then assigns a sequential number to a second storage device one skiUed in the art will appreciate that tht present inven- 

of each subsystem. This process continues until all storage tion can stripe many video images across the storage 

devices are assigned a number. After assigning a number to devices. The first step performed by the prcfored embodi- 

eacfa storage device, the preferred embodiment stripes the 5s ment when striping data is to select the next storage device, 

data across the storage devices (step 602). In this step, the starting with an arbitrary storage device (step 702). In ttiis 

prefcrredembodimentof the present invention stripes all the step, the preferred embodiment selects the next storage 

blocks for one or more video images across the primary device for storing a block of data or upon the first invocation 

portions of the storage devices. In addition, the prefened of this step, the piefeired emtwdiment selects an arbitrary 

embodiment divides each block into '*D" sub-blocks and 60 storage device. When the preferred embodiment selects the 

stores the sub-blocks for a particular block on the **D" next storage device, if the last storage device is encountered, 

numoicaUy following storage devices. This step will be the preferred embodiment wraps around and selects the first 

described in greater detail below. After striping the data storage device. Alternatively, instead of selecting an axbi- 

across the storage devices, the preferred embodiment trary storage device, one skilled in the art will appreciate that 

receives subscriber requests (step 604), 65 the storage device that is least fiill may be initially selected 

After receiving subscriber requests, the preferred embodi- by the present invention. After selecting the next storage 

ments schedules the subscriber requests (step 606). In this device, the prefened embodiment selects the next block of 



5,65 

13 

data, starting with tbc first (step 704). That is, the prefcacd 
anbodiment selects the next block of daca from the video 
image to be stored or the first block of data if this step is 
bdng invoked for the first time. After selecting tiie next 
Uock of data, the prefeired embodiment stores the selected 
block on the primary portion of the selected storage device 
(step 706). After stanng the block of data on the primary 
portion of the storage device, the prefened embodiment 
divides the block into ""D** sab-blocks and stores the sub- 
blocks on the secondary portion of the next sequential '*D" 
storage devices (stq> 708). In this step, after dividing the 
block into sub-blocks, the sub-block couesponding to the 
first part of the block is stored on the next sequential storage 
device and each subsequent sub-block is stored on a sequen- 
tially following storage device. After dividing and storing 
the sub-blocks on the secondary portions of the storage 
devices, the preferred embodiment determines whedier there 
are more blocks in the video image to be stored (step 710). 
If there are more blocks to be stored, the preferred embodi- 
ment continues to step 702 whcxdn the preferred embodi- 
ment selects the next sequential storage device. However, if 
all of the blocks have been stored, processing ends. 

FIG. 8 depicts a flowchart of the steps performed by the 
preferred embodiment of the present invention when trans- 
mitting blocks in sequence to subscribers. Steps 804-806 
reflect the normal mode processing performed by the pre- 
ferred embodiment of the present invention. Steps 808-818 
describe the failure mode processing performed by the 
preferred embodiment of the present invention. The first step 
performed by the preferred embodiment is to determine if a 
component has failed (step 802). In this step, the system 
detects whether a subsystem has failed or a storage device 
has failed. The system detects the failure of a subsystem by 
using a "deadman {rotocol." In utilizing the deadman 
protocol, each subsystem sends a ping (i.e., a message) after 
a predetermined amount of time to the sequentially preced- 
ing subsystem and listens to the subsystem that sequentially 
follows the subsystem. If a subsystem has not received a 
ping within a predetermined period of time from the sequen- 
tially following subsystem, a timo-out occurs. Upon the 
time-out occurring, the subsystem signals the controller and 
the controller sends a ping to the sequentially following 
subsystem. If the sequentially following subsystem does not 
respond to the ping from the controller, the controller 
determines that the sequentially following subsystem has 
failed. The detection of a storage device failure occurs when 
a subsystem detects that one of its storage devices is no 
longer sending data. After detecting that the storage device 
is no longer sending data, the subsystem sends a message to 
the controller indicating tiie failure of the storage device. If 
the preferred embodiment does not detect a component 
failure, processing continues to stq> 804 and the preferred 
embodiment performs normal processing. 

In performing normal processing, the preferred embodi- 
ment receives data from the subsystems (st^ 804). In this 
step, the prefexred embodiment accesses the column of time 
slots and processes subsaibcr requests for the primary 
period of each time slot In effect, each storage device 
marches down its column of time slots and processes each 
subscriber request. In processing subscriber requests, the 
storage devices send the appropriate block of data for a 
particular subscriber. After receiving the data from the 
stcM'age devices, the preferred embodiment sends the data to 
the subscribers (step 806). In this step, the system deter- 
mines the subscriber for each block of data received and 
sends the blocks to the appropriate subscriber via the inter- 
coimcction network. 
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the prefexred onbodiment detects a canq)onent faihire, 
the preferred embodiment realigns the deadman protocol if 
the component faihire detected is a subsystem faihire (step 
808). When the preferred embodiment realigns the deadman 
protocol, it indicates to the immediately follovmg sub- 
system to send the ping to the immediately preceding 
subsystem of the failed subsystem. In addition, the imme- 
diately preceding subsystem listens for the ping of the 
inunediately following subsysteno. After realigning the 
^ deadman protocol, the preferred embodiment adjusts the 
column of timye slots for each "D** following storage device 
(step 810). This is done by inserting entries into the sec- 
ondary period of each time slot for each ''D** following 
storage device. The entry in each secondary period corre- 
sponds to the entry in the primary p^od of the same time 
* unit for the storage device that failed. For example, if in time 
slot one, the failed storage device were to send a particular 
block to subscriber 6, the "D** following storage devices 
send the correspondhig sub-block of subscriber 6 during the 
secondary period of the time slot that they are currently 

20 processing when the failed storage device would have been 
processing the first time slot In this step, when a subsystem 
fails, the storage devices for the failed subsystem are treated 
as having failed. 
After adjusting the time slots, the preferred embodiment 

23 receives data from the subsystems (step 812). In this step, 
the preferred embodiment receives both blocks of data from 
the primary portion of the storage devices as well as sub- 
blocks from the secondary portion of the storage devices. 
After receiving data from the subsystems, the preferred 

30 embodiment sends the data to the subscribers (step 816). The 
processing of this step is similar to that as described relative 
to step 806 above, except that upon receiving sub-blocks, the 
subscribers combine the sub-blocks into aggregate blocks. 
After sending data to the subscribers, the system determines 

35 if there is an additional component failure (step 818). The 
processing of this step is similar to that as described relative 
to step 802 above. If an additional component failure is 
detected, processing continues to step 808. However, if an 
additional coiiq>onent failure is not detected, processing 

40 continues to st^ 812 and the preferred embodiment con- 
tinues to operate in failure mode. It should be noted that the 
preferred embodiment operates in failure mode until a 
system administrator can replace the component that has 
failed. However, until that time, the video-on-demand sys- 

45 tcm of the present invention continues to deliver data 
streams to subscribers without the subscribers noticing any 
interruption in the data streams. Therefore, the preferred 
embodiment of the present invention sends data to subscrib- 
ers at a constant rate and can thus guarantee the constant rate 

50 in the face of at least one component failure. 

While the present invention has been described with 
reference to a preferred embodiment thereof, those skilled in 
the art will appreciate that various changes in form and detail 
may be made without departing from the spirit and scope of 

55 the present invention as defined in the appended claims. For 
instance, other storage media may be used and different 
quantities of storage media may be used. In addition, dif- 
ferent declustering numbers may be used and the ordering of 
the storage devices may differ. 

60 We claim: 

1. A continuous media server system having a consumer 
for consuming data at a given amount per time interval, the 
continuous media server system for delivering data to the 
consumer at the given amount per the time interval, ooii>- 

65 prising: 

a plurality of storage devices containing data, wherein the 
data comprises blocks and sub-blocks, wherein a block 
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is divided into a dustedng mimber of sub-blocks, 
wherein the clustedng number is a number greater than 
one, and wherein sub-blocks for a block on a first 
storage device are stored on die clustering aumber of 
storage devices that follow the first storage device; and 
a send component for sending a sequence of the data to 
the consumer at the given amount per the time interval, 
wherein the sequence conqvises the blodcs and when a 
failure occurs such that a block cannot be sent to the 
consumer, the sequence ooixqnlses the sulvblocks for 
the block firom the clustering number of storage devices 
that follow the storage device that stores the block to 
ensure that the sequence of data to the consumer is 
unintemipted. 

2. The continuous media server system of daim 1 wherein 
the storage devices are sequential and the sub-blocks arc 
striped across the storage devices. 

3. The continuous media server system of daim 1 wherein 
the storage devices conqjiise a faster region and a slower 
region such that the blocks of data are stored on the faster 
region of the storage device and the sub-blocks of data are 
stored on the slower region of the storage devices. 

4. The continuous media server system of daim 1 wherein 
the storage devices comprise a fast region, a medium speed 
region and an unused region such that the blocks of data are 
stored on the fast region of the storage device, the sub- 
Uocks of data arc stored on the medium speed region of the 
storage device and the unused region comprises a portion of 
the storage device that has a slower data transfer rate than 
the fast region and the medium speed region. 

5. The continuous media server system of claim 1, further 
induding a rcscrver component for reserving bandwidth for 
sending both sub-blocks and blocks to the consumers, 
wherein bandwidth is output capacity d the system, and 
wherein the send component sends the blocks from the 
storage devices to the consumers utilizing the reserved 
tnndwidth and when a storage device fails, the send com- 
ponent sends the sub-blocks utilizing the reserved l>and- 
width. 

6. The continuous media server system of claim 1, further 
con^)rising a plurality of subsystems for managing the 
storage devices and induding a means for sending the 
sub-blocks from the dustering number of storage devices 
that follow a second storage device when a subsystem that 
manages the second storage device fails. 

7. The continuous media server system of claim 1, fiirther 
comprising an ordering of subsystems for managing the 
storage devices and a numbering con^nent for perf onning 
a sequence of assigning sequential numbers to the storage 
devices wherein one storage device is assigned a sequectia) 
nomber from each ordered subsystem and for repeating the 
sequence until all storage devices are assigned a sequential 
number. 

8. The continuous media server system of daim 1 wherein 
the system is a video-on-demand system and wherein the 
data is video image sequences. 

9. The continuous media server system daim 1 wherein 
&e sequence of the data is a stream of the data and wherein 
the send component sends the stream of the data to the 
consumer at a constant rate over a period of time. 

10. In a video-demand system having a consumer for 
consuming data at a given amount per time interval, the 
video-on-<iemand system for delivering data to the consumer 
at the given amount per the time interval, the video-on- 
demand system having a plurality of sequential storage 
devices for storing data, wherein the data comprises video 
image sequences having sequential blocks, a method com- 
prising the steps of: 
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under the control of tiie video-on-demand system, 
storing the blocks on the storage devices such that after 
storing a block a next sequential block is stored on a 
next sequential storage device; 
3 dividing the blocks into a dustering number of sub- 
tdocks, wherein the dustedng number is a number 
greater than one; and 
fix each block, 
stodng sub-blocks for the block on the dustering 
number of storage devices that sequentially follow 
a storage device on which the block is stored. 

11. The method of daim 10 whercm the storage devices 
comprise a faster region and a slower region, wherein the 
step of storing the blocks indudes the step of storing the 

J 3 blocks on the faster region of the storage devices such that 
after storing a block, a next sequential block is stored on a 
next sequential storage device, and wherein the step of 
storing sub-blocks indudes the step of st<King sub-blod^ for 
the block on the slower region of the dustering number of 

2Q storage devices that sequentially follow the storage device 
on which the block is stored. 

12. The method of clafTT i 11, further induding the step of 
providing an unused region to the storage devices that is a 
portion of the storage device having a slowest data transfer 

25 

13. The method of daim If who^ein the storage devices 
are managed by sequential subsystems and wherein the 
method further indudes the steps of: 

performing a sequence of assigning sequential numbers to 
30 the storage devices wherein one storage device is 
assigned a sequential number from each sequential 
subsystem; and 
rq>eating the sequence until all storage devices are 
assigned a sequential number. 
35 14. In an on-demand media server system having a 
consimier for consuming data at a given amoimt per time 
interval, a plurality of components and a controller, the 
components comprising a sequence of storage devices for 
storing blocks of data and sub-blocks of data and a plinality 
40 of subsystems for managing the storage devices, the con- 
troller for managing the subsystems, wherein the storage 
devices conqmse a primary portion for storing the blo(±s 
and a secondary portion fcx storing the sub-blocks, wherein 
the blocks are sequential and each block is divided into a 
45 clustering number of sub-blocks, wherein the dustering 
number is a number greater than one, a method for guaran- 
teeing data delivery to the consumer at the given amount per 
the time interval, con^rising the steps of: 

under the control of the controller of the on-demand 
50 media server system, 

receiving blocks from the primary portion of the stor- 
age devices; 
sending the received blocks to the consimicrs; 
determining when a component has failed; and 
55 when it is determined that a component has failed, 

receiving sub-blocks from the secondary portion of 
the clustering number of storage devices that 
sequentially follow the component that failed; 
combining the received sub-blocks to create an 
60 aggregate block; and 

sending the aggregate block to the consumers. 
15. The method of daim 14 wherdn the storage devices 
comprise a faster region and a slower region, wherein the 
primary portion of fte storage devices is located on the faster 
65 region and the secondary portion of the storage devices is 
located on the slower region, wherein the step of receiving 
blocks further includes the step of recdving blocks from the 
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primary portioii on the faster legion of the storage devices, 
and wherein the step of receiving sub-blocks further 
includes the step d receiving sub-hlocks from the secondary 
portion on the slower region of the clustering number of 
stcffage devices that sequentially follow the component that 
failed. 

16. In a continuous media server system having a con- 
sumer for consuming data at a given amount per time 
interval, and a plurality of sequential storage devices for 
storing sequential blocks of data and sub-blocks of data, a 
method for guaranteeing data delivery to the consumer at the 
given amount per the time interval, coinprising the steps of: 

under the control of the continuous media server system, 
striping the blocks sequentially across the storage 

devices; 
for each block, 
dividing the block into a clustering number of sub- 
blocks, wherein the clustering number is a number 
greater than one; 
storing the sub-blocks on the clustering numbo- of 
storage devices that sequentially follow a storage 
device containing the block; 
providing the storage devices with a time slot for 
sending data, wherein the time slot has a primary 
period and a secondary period; 
during the primary period of the time slot, 

sending blocks from storage devices to consum- 
ers; determining whether a storage device has 
failed; and when it is determined that a storage 
device has failed, 
sending sub-blocks from the clustering number of 
storage devices that sequentially follow the 
storage device that failed during tiie secondary 
period of the time slot 

17. The method of daim 16 wherein the st«age devices 
comprise a faster region and a slower region, wherein die 
blocks are stored on the faster region and the sub-blocks are 
stcHred on the slower region, wherein the step of sending 
blocks includes the step of sending blocks from the faster 
region of the storage devices to consumers and wherein the 
step of sending sub-blocks includes the step of sending 
sub-blocks from the slower region of the clustering number 
of storage devices that sequentially follow the storage device 
that failed. 

18. The method of daim 16 wherein the storage devices 
are managed by sequential subsystems and wherdn the 
method further includes the steps of: 

perfoixning a sequence of assigning sequential numbers to 
the storage devices wherein one storage device is 
assigned a sequential number from eadi sequential 
subsy^em; and 

repeating the sequence until all storage devices are 
assigned a sequential number. 

19. In a data processing system having a consumer for 
consuming data at a given amount pa time interval, a 
method for guaranteeing data delivery to the consumer at the 
given amount per the time interval, conqirising the steps of: 

providing a continuous media server system to the data 
processing system for guaranteeing data deliveiy to the 
consumer at the given amount per the time Interval, the 
continuous media server system con^rising a plurality 
of sequential storage devices for storing data and a 
plurality of sequential servers for managing the storage 
devices, wherein the data comprises sequential blocks; 

storing the blocks on the storage devices by the continu- 
ous media server system such that after staring a block 
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a next sequential block is stored on a next sequential 
storage device; 
dividing the blocks into a clustering number of sub-blocks 
by the continuous media server system, wherdn the 
5 dustcnng number is a number greater Chan one; and 
storing sub-blocks for a block on a storage device that is 
managed by a server on a storage device of the dus- 
tering number of servers that foUow the server by the 
continuous media server system. 
IQ 20. The method of claim 19, wherein the step of storing 
sub-blocks indudcs the step of storing sub-blocks for a 
second block on a second storage device that is managed by 
the server on a storage device of the second clustering 
number of servers that follow the first dusteting number of 
servers. 

21. In a continuous media server system having a con- 
sumer for consuming data at a given amount per time 
interval, and a plurality of sequential storage devices for 
storing data tiiat are grouped into dusters of storage devices, 
wherein the data oonqmcs sequential blocks, a method for 

^ guaranteeing data ddivery to the consumer at the given 
amount per the time interval, conqnising the stqps of: 
under the control of the continuous media server system, 
storing the blocks on the storage devices such that after 
storing a block a next sequential block is stored on a 
25 next sequential storage device; 

dividing the blocks into a dustering number of sub- 
blocks, wherein the dustering number is a number 
greater than one; and 
storing sub-blocks for a block on a storage device 
30 within a duster on a dustering number of storage 

devices widiin the duster. 

22. In a video-on-demand system having a consumer for 
consuming data at a constant rate over a period of time, and 
a plurality of sequential storage devices for storing data, the 

33 data comprising video Image sequences having sequential 
blocks, a method far guaranteeing a stream of data to the 
consumer at the constant rate, comprisiag the steps of: 
under the control of the video-on-demand system, 
storing the blocks on the storage devices such that after 
40 storing a block a next sequential block is stored on a 

next sequential storage device; 
dividing the blocks into a dustering number of sub- 
blodLS, wherein the dustering number is a number 
greater than one; 
45 storing the sub-blocks for each block on the dustering 
number of storf^e devices ttiat follow die storage 
device on which the block is stored; 
receiving a request for a stream of the data from the 
consumer; 

50 determining whether a stcrage device has failed; 

when it is detcnmncd that a storage device has failed, 
for each block, 

if the block is not located on the storage device 
that failed, 

55 sending the block to the consumer, and if the 

block is located on the storage device that 
failed, 

sending the sub-blocks for the block to the con- 
sumcr from the dustering number of storage 
60 devices that follow the storage device that 

failed to ensure that the stream of data is 
uninterrupted due to the storage device failure; 
and 

when it is determined that a storage device has not 
65 failed, 

for each block, 

sending the block to the consumer 
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23. A con^tcTrreadable media whose contents cause a 
coatiDUOUs media server system to become fault tolerant, the 
continuous media server system having a consumer for 
consuming data at a constant rate over a pedod of time and 
a plurality of sequential storage devices for storing data, the 
data coursing blocks, the continuous media media server 
system for sending data to the consumer at the constant rate 
over the period of time, by performing the steps of: 
storing the blocks on the storage devices of the continuois 
media server system such that after storing a block, a 
next sequential block is stored on a next sequential 
stofage device; 
dividing the blocks into a clustering number of sub- 
blocks, the clustering number is a number greater than 
one; 

storing the sub-blocks for each block on the storage 
devices of the continuous media server system that 
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sequentially follow the storage device on whidi the 
block is stored; 
receiving a request for the data; and 
5 for each block, 

determining if the block is located on a storage device 
that faUed; 

if it is determined that the block is located on a storage 

device that has not failed, 
10 sending the block to the consumer, and 

If it is determined that the block is located ona storage 

device that has failed, 

sending the sub-blocks for the block to the consumer 
to ensure that the constant rate at which the data is 
sent to the consumer does not change due to the 
failure of the storage device. 

* * * ^ 
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[57] ABSTRACT 
A multiple processor (CPU) computer system, each 
CPU having a separate, local, random access memory 
means to which it has direct access. An interprocessor 
bus couples the CPUs to memories of all the CPUs, so 
that each CPU can access both its own local memory 
means and the local memories of the other CPUs. A run 
queue data structure holds a separate run queue for each 
of the CPUs. Whenever a new process is created, one of 
the CPUs is assigned as its home site and the new pro- 
cess is installed in the local memory for the home site. 
When a specified process needs to be transferred from 
its home site to another CPU, typically for performing 
a task which cannot be performed on the home site, the 
system executes a cross processor call, which performs 
the steps of; (a) placing the specified process on the run 
queue of the other CPU; (b) continuing the execution of 
the specified process on the other CPU, using the local 
memory for the specified process's home site as the 
resident memory for the process and using the interpro- 
cessor bus to couple the other CPU to the home site's 
local memory, until a predefmed set of tasks has been 
completed; and then (c) placing the specified process on 
the run queue of the specified process's home site, so 
that execution of the process will resume on the pro- 
cess's home site. 

13 Claims, 6 Drawing Sheets 
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PROCESS DISTRIBUTION AND SHARING 
SYSTEM FOR >AJLTIPLE PROCESSOR 
COMPUTER SYSTEM 

3 

This is a continuation of application Ser. No. 907,568 
filed Sept 13, 19S6, now abandoned. 

The present invention relates generally to multiple 
processor computer systems, and particularly to appara- 
tus and methods for moving a process from one site to 10 
another within a multiple processor computer system. 

BACKGROUND OF THE INVENTION 

The prior art includes a large number of difFcrent 
multiple processor computer systems, and a number of 15 
variations on the UNIX (a trademark of AT&T) operat- 
ing system. 

For the purposes of this introduction, multiple pro- 
cessor computer systems can be generally classified mto 
two distinct types: (1) those that perform complex cal- 20 
culations by allocating portions of the calculation to 
different processors; and (2) those thai are enhanced 
multitasking systems in which numerous processes are 
performed simultaneously, or virtually simultaneously, 
with each process being assigned to and performed on 25 
an assigned processor. The present invention concerns 
the second type of multiple processor system. 

In order to avoid confusion between the terms "pro- 
cessor** (which is a piece of apparatus including a cen- 
tral processing unit) and "process" (which is a task 30 
being performed by a computer), the terms "site" and 
"CPU" shall be used herein synonymously with "pro- 
cessor^'. For instance, when a process is created, it is 
assigned to a particular site (i.c., processor) for execu- 
tion. 35 

As background, it should be understood that in any 
multitasking system, there is a "nm queue" which is a 
list of all the processes which are waiting to run. In most 
systems the run queue is a linked list. When the system 
is done with one process (at least temporarily) and 40 
ready to start running another process, the system looks 
through the run queue and selects the process with the 
highest priority. This process is removed from the run 
queue and is run by the system until some event causes 
the system to stop the selected process and to start 45 
running another process. 

In prior art multiple processor (also called multiple 
CPU) systems, there is generally a single large memory 
and a single nm queue for all of the processors in the 
system. While the use of a single run queue is not inher- 50 
cntly bad, the use of a single large memory tends to 
cause increasing memory bus contention as the number 
of CPUs in the system is increased. 

Another problem associated with most multiple CPU 
computer systems, only one of the CPUs can perform 55 
certain tasks and functions, such as disk access. There- 
fore if a process needs to perform a particular function, 
but is running at a site which cannot perform that func- 
tion, the computer system needs to provide a method 
for that process to perform the function at an appropri- 60 
ate site within the system. 

Generally, the problems associated with such "cross 
processor calls" include (1) minimizing the amount of 
information which b moved or copied from one site to 
another each time a process makes a cross processor 65 
call; (2) devising a method of updating the system's run 
queue(s) which prevents two processors from simulta- 
neously changing the same run queue, because this 
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could produce unreliable results; and (3) providing a 
method for efficientiy transferring a process to a an- 
other site and, usually, then automatically transferring 
the process back to its original site. 

The present invention solves the primary memory 
contention and cross processor call problems associated 
with prior art multiple CPU systems by providing a 
separate local memory and a separate run queue for 
each processor. Memory contention is minimized be- 
cause most processes are run using local memory. When 
a process needs to be transferred to a specified proces- 
sor a cross processor routine simply puts the process on 
tiie run queue of the specified CPU. The resident mem- 
ory for the process remains in the local memory for the 
process's home CPU, and the specified CPU continues 
execution of the process using the memory in the home 
CPU. The process is transferred back to its home CPU 
as soon as the tasks it needed to perform on the specified 
CPU are completed. 

It is therefore a primary object of the present inven- 
tion to provide an improved multiple CPU computer 
system. 

Another object of the present invention is to provide 
an efficient system for transferring processes from one 
CPU to another in a multiple CPU computer system. 

SUMMARY OF THE INVENTION 
In summary, the present invention is a multiple pro- 
cessor (CPU) computer system, each CPU having a 
separate, local, random access memory means to which 
it has direct access. An interprocessor bus couples the 
CPUs to memories of all the CPUs, so that each CPU 
can access both its own local memory means and the 
memory means of the other CPUs. A run queue data 
structure holds a separate run queue for each of the 
CPUs. 

Whenever a new process is created, one of the CPUs 
is assigned as its home site and the new process is in- 
stalled in the local memory means for the home site. 
When a specified process needs to be transferred from 
its home site to another CPU, typically for performing 
a task which cannot be performed on the home site, the 
system executes a cross processor call, which performs 
the steps of: (a) placing the specified process on the run 
queue of the other CPU; (b) continuing the execution of 
the specified process on the other CPU, using the mem- 
ory means for the specified process*s home site as the 
resident memory for the process and using the interpro- 
cessor bus means to couple the other CPU to the home 
site memory means, until a predefmed set of tasks has 
been completed; and then (c) placing the specified pro- 
cess on the run queue of the specified process's home 
site, so that execution of the process will resume on the 
process's home site. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Additional objects and features of the invention will 
be more readily apparent from the following detailed 
description and appended claims when taken in con- 
junction with the drawings, in which: 

FIG. 1 is a block diagram of a multiprocessor com- 
puter system, and some of its most important data struc- 
tures, in accordance with the present invention. 

FIGS, 2A and 2B schematically represent the cross 
processor call procedure of the preferred embodiment. 

FIG. 3 is a block diagram of the CPUSTATE data 
structure used in the preferred embodiment of the pres- 
ent invention. 
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FIG 4 is a flow chart of the process by which a insertion of a cross processor call in sjwtem routines 

lio»lomo.= .proco>(to»OKMeloanoU»rm. ^ J^^, ^ ^^toe „,aK pifeiid anbodi- 

computer system. t 

FIO. 6 is a flow chart of the «;j^f^^^^^^^ ^Referring to FIG. 2A. the process identified as Proc 

which IS u^by the context switchmg method dia- ^ ^ ^o^, ri^ on an appUcations pro- 

gramed in FIG. 4. j ^ ^.•ot;«« cesser (i.c.. not the main processor) and to be running 

FIG-Tisaflow chart of theme^^^ TtSIt home site. The prLess runs until asystcm rou- 

a new process and assigmng it a ^^^^"^ ^ tine is caUed (box 50). If the system routine can be run 

FIG. 8 IS a block diagram of a ^irtud m^ry m^^^ ^ ^ ^^^^^ ^ 

agementpage table usedmthepreferredembodmientof ^ ^ ^ continues to run on the home 
the present invention. 

DESCRIPTION OF THE PREFERRED If, however, the system routine cannot be run locally 

EMBODIMENT (box 52), then the process is put on the main processor's 

^, , runqueue(box56), where it waits until the main proces- 

Referring to FIG, t there is shown a block diagram execution (box 58). Then the 

of a multiprocessor computer system 20. and some of its ^^^^^ resumes running on the MP, where it runs the 

most important data structures, in accordance with the ^ ^^^^ routine. 

present invention. In a typical configuration, the system ^^^^^ ^y^^^^ routine completes the tasks which 

20 includes a main processor MP, and a plurality n of ^ performed by the MP, tiie process is put back 

application processors API to APn. ^5 ^^^^ quQMC (box 62), where it waits 

All of the system's processors are homogeneous, ^^^^^ ^py ^ ^^^^y p^^j^ (^qx 64) and 

separate one-board microcomputers usmg a Motorola continue execution of tiie process (box SO). 

68020 central processing unit For convenience, these Looking at this same process from another perspec- 

processors are also called "CPUs" and "sites". ^^^^^ ^ p^Q 2B at time zero process PI is running on 

The system 20 is a multitasking computer system processor AP and process P3 is running on the main 

which typically has a multiplicity of processes, ^so processor MP. At time il process PI performs a system 

called •'user processes" or "tasks", running and waiting a system routine) which requires pro- 

to run. As will be described in greater detail below, cessing on the MP. Therefore, shortly after the system 

each user process is assigned a "home site" or "home ^^j^ Pi's context is saved (in its user struc- 

processor" when it is created. 35 ture) and PI is put on MP's run queue. It should be 

r^r^cc Pr«o<.«c^r r^\w noted that PI is usually given a high priority when it 

Cross Processor cans performs a cross processor call so that its request will be 

The present invention provides a simple system for serviced quickly. Also, when the AP stops running PI it 

temporarily moving a user process away from its home pjcj^s up another process P2 from its run queue and runs 

site to a specified remote site. To do this, the state of the 40 that process. 

process is saved (just as it would be saved whenever the continues to run process P3 until some event 

process is suspended in any other multitasking system), blocks or interrupts P3's execution at time t3, at which 

and then a pointer to the user process is simply added to point the MP will run PL At time t4, when PI fmishcs 

the remote site's "run queue" and removed from the list the system tasks which required the use of the main 

of processes running on the home site. 45 processor MP, Pl*s context is saved and it is put back on 

When the remote site picks up this user process from the run queue for its home site, AP. At this point the 

its run queue, it merely reinstates the process and runs Mp picks up the highest priority process from its run 

the process just like any other process. After the task queue, which may be the process P3 that was inter- 

which required the cross processor call is completed, rupted by PI. 

the interprocessor transfer is reversed by saving the 50 AP continues to run process P2 until some event 

user process's state and adding a pointer to the user blocks or interrupts P2*s execution at time t5, at which 

process to the run queue for its home site. point the AP wUl pick up the highest priority process 

The system's memory is organized so that the remote from its run queue, which may be the process PI. 

site can use the user process's memory at its home site, Memory 

rather than moving the process's resident memory to 55 ^ . 

tiie remote site. Referring again to FIG. 1. each CPU has its own 

There are several advantages to this approach. The memory module MEM— MP, and MEMl to MEMn, 

first is that the context which requires a user process to which is used as the primary random access memory for 

be moved away from its home site need not be copied its corresponding CPU. 

into a message encapsulating the request, and all refer- 60 The resident memory for each user process in the 

ences to parameters needed by the process can be made system is located in the memory module for its home 

directiy to the process's resident memory at the home site. 

site. Secondly, there is no need to synchronize one While each CPU has its own memory module, these 

processor with any other. Third, the cross processor memory modules are multiported and conneaed to at 
call is virtually transparent to the system and requires 65 least one bus so that all physical memory in the system 

very littie overhead and minimal modification of the can be accessed by any processor in the system. That is, 

system's operating system. In most instances, the only all of the system's physical memory resides in a single, 

modification to the operating system required is the large physical address space. Also, any given processor 
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can use the page tables describing the viitual memory of Interprocessor Busses 

"^l^r^TL memory organization, any user Another difference between 

pro^ S system can execL oi^of the system's and the apphcauons processor. API to APn ^ tot 

process muicbyfttc^ will CA „l^«c»« icirf^^nt 3 main processor has different mterprocessor bus conncc- 

processors, even though the '^J^^'^f.'^^^l ^ons tfian the other processors. 

Eticmory is located on only one processor (the home ^ .^^^^^^ mterprocessor busses 24 

_ . and 26. One, called the system composition bus 24, can 

Since access to local memory Ci-c. access by a CPU ^^^^ between processors at a rate of 12 mega- 

to its own memory module) is much faster than cross- ^ MB/sec). The other bus, called the 

processor memory access, the system is designed to run jnterprocessor bus 26, can transfer data between proces- 

a process as much as possible on ite home processor, and 33 3 j^/sec. The provision of two such busses 

to move its execution to another processor only when allows the faster bus to handle video tasks and other 

necessary. Normally, a process's execution is moved which require high data rates, while the slower bus 

away from its home site only to have a system request ^5 handles interprocessor memory requests between the 

serviced which cannot be serviced at the home site. main processor and one of the applications processors. 

The main prxx:essor's memory module MEM— MP All of the applications processors, API to APn and 
holds a number of important data structures including a the system's Display Processor 28 are connected to both 
Process Table 30 which holds essential data regarding busses 24 and 26. The main processor MP is not con- 
each of the processes in the system, a CPUSTATE data 20 nected to the faster bus mostly to avoid the cost of 
structure which contains important data on the status of adding another high speed port to the main processor, 
the processor, and a set of USER data structures which which is already burdened with the disk controller and 
hold data relating to the state of each process aUocated two terminal ports (not shown in the Figures), 
to the main processor. Indivisible Read-Modify- Write Instructions 

Each processor has a CPUSTATE data structure. .^^^ , . ^ . ^ ,1. t 

wSh is iored in its local memory. Each processor^ M wm be explamed m more detail below the proce^ 

wmcn 15 siorca m i\a luwu mcmui f ^ ^ ^ ^ perform at least one of 

local memory also ^ an of USER dau struc- V^^^ read-modify-writ^^tructions" known 

tures. which are used to hold data reladng o the pro- ^ ^^^^ ^ ^^^S) 

cesses aUocated to that processor. The details and uses g^.^^jy^ ^ "indivisible read-modify-write instnic- 

of these data structures arc cxplamed below. ^^^„ ^ instruction which involves two steps, a test 

Operating System step and then a conditional data writing step, which 

cannot be interrupted until it is complete. 

All of tiie CPUs use a modified UNIX operatmg instance, the TAS instruction can be used to test 
system. (UNIX is a trademark of AT&T BeU Laborato- 35 ^j^^ ^ specified memory location is equal to 
ries.) Full UNIX functionality is available in all the ^ero, and, if so. to set it equal to one. Making this in- 
processors. To accon^)lish this, the UNIX kernel has struction "indivisible" ensures that while processor API 
been modified by dividing or replicating portions of the performs a TAS on memory location X, no other pro- 
kernel among the MP and attached APs, and adding cesser can modify X. Otherwise, another processor, 
new interprocessor communicauon facilities. The inter- 40 such as MP or AP2 could modify X after API had tested 
processor communication facilities allow requests X's value but before API was able to set X to 1. 
which cannot be handled locally (e.g., on one of the An indivisible compare and swap (CAS) instruction 
APs) to be handled by another processor (e.g., the main works similarly to the TAS instruction, except that the 
processor MP). ^^^P compares a first CPU register with a memory 

For those not skilled in the art, it should be known 45 location, and the second step stores the value in a sec- 

that tiie term "kernel" is generally used to refer to a set ond CPU register into the memory location if the com- 

of software routines, including system routines which parisonjs test criterion is satisfied, 

handte most of the system's hardware dependent func In addition to the run queue lock, indivisible read- 

tions-such as disk Less and other input and output modify-wnte instructions are used for updatmg all data 

^ „vjTY w«ai Hmio^iiiv nnniprf from 50 structurcs that could othcrwise be snnultaneously ac- 

operations The UNIX k^el is typically cop ed from processor. In 

disk mto the system's other words, each such data structure must have a cor- 

main processor's memory MEM_MP) whenever the ^^^p^^^^g y^^^ indivisible read modify 

system is restarted. ^ ^ , write instruction must be used to test the lock flag be- 

The main difference between the mam processor MP ^^^^ corresponding data structure is updated. Data 

and the applications processors API to APn is that the structures, such as the USER structure, which cannot 

main processor MP U the only processor that can per- simultaneously accessed by more than one processor 

form certain system functions. For instance, the main ^^^^ ^^^^ ^ock protection. As will be understood 

processor is the only one which can access the system's ^^^^ skilled in the art, an example of another data 
disk storage units 22. 60 structure which requires the use of a lock flag is the 

The primary goal for the allocation of system func- sleep queue for each processor, 

tions between the processors is to make the operation of p x 

each processor as autonomous as the system's hardware Process Table 

configuration will allow. In other words, each AP is in the main processor's memory module there is a 
allowed to perform as many system functions locally as 65 data structure called the process table 30. This table has 

is consistent with the system's hardware. This mini- one row 38 of data for each user process in the system, 

mizes the frequency of interprocessor function calls and including the following data which are used in the pres- 

interprocessor memory access. eni invention. 
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There is a priority parameter 31 which indicates the processor must wait until it is done. To prevent exces- 

relative priority of the process. In the preferred embodi- sive bus trafftc caused by such situations, if a proce^r 

ment numerically low priority parameter values are finds that a run queue is locked, it is forced to wait a 

used to indicate high process priority. User processes preselected amount of time (e.g., 0.01 milliseconds) 

are assigned priority values between 127 (the lowest 5 before testing the Run Lock agam. 

priority) and 25 (the highest priority), while system If the Run Lock is equal to zero, the run queue is 

tasks are assigned priority values between 24 and zero. unlocked and the process performing the test sets the 

Each process m the system is either actively running. Run Lock before proceeding to modify the processor's 

is waiting to run, or is temporarily inactive. The pro- run queue. When the processor is done with the run 

cesses waiting to run on each processor are placed on 10 queue, it unlocks the run queue by resetting the Run 

separate run queues. Similarly, there is a set of sleep Lock to zero. 

queues for temporarily inactive processes, and actively The Preemption Flag parameter 40/ is used to force 

running processes are not on any queue as long as they the processor to look on its run queue for a process of 

are running. higher priority than the currently running process. Nor- 

Each run queue is simply a linked list formed using 13 maUy. the search for a new process is performed only 

the queue link parameter 31 in the process table 30. For when the currently running process finishes or reaches 

each CPU there is a CPUSTATE data structure 40 a block (such as a cross processor call) which causes it 

which points to the row 38 of the process table 30 for to stopped, at least temporarily. The current process 

the first process in its run queue. The queue link 31 for can be preempted, however, if the Preemption Flag 40/ 

that process points to the row of the process table for 20 is given a nonzero value. 

the next process in the processor's run queue, and so on. Toward the end of the processing of every interrupt 

The entry for the last process in each run queue has a which can interrupt a user process, the Preemption Rag 

zero in its queue link to indicate that it is at the end of 40/ is checked. If the Preemption Flag is nonzero, a 

the queue. search for a higher priority process than the currently 

For each process, the process table 30 also includes a 25 running process is initiated. If a higher priority process 

Home CPU parameter 33 which identifies the assigned is found, the current process is stopped and saved, and 

home CPU for the process, and a Current CPU parame- the higher priority process is run. In any case, at the end 

tcr 34 which identifies the current CPU on which it is of the preemption search, the Preemption Flag 40/ is 

running or waiting to run. The Next CPU parameter 35 automatically reset to zero. 

is used in cross processor calls to indicate the CPU on 30 The set of interrupts which can initiate a preemption 

which the process is to be run: scarth include the interrupt generated by a processor 

Finally, for each process there is a User Struc param- when it puts a high priority process on the run queue of 

eter 36 which points to the User Structure for the pro- another processor, and a clock interrupt, which occurs 

cess. A separate User Structure, which is a (3072 byte) 64 times per second. 

buffer, is assigned to every process for storing the state 35 Next, the CPUSTATE data structure contains a set 
of the process, certain special parameters, and for hold- of Performance Statistics 40^ which, as described be- 
ing a stack called the system mode stack that is used low, are used to help select to a home site for each new 
when the process performs certain system mode func- process created by the system 20. 
tions. The CPUSTATE data structure 40 also contains a list 

While the process table 30 contains other parameters 40 40A of the free pages in its physical memory for use by 

used by the UNIX operating system, only those used for its virttial memory management system: 

cross processor calls are described herein. ^^^^^ Processor Calls 

CPUSTATE Data Structure pjQ 4 a flow chart of the process by which a 

FIG. 3 is a block diagram of the CPUSTATE data 45 system subroutine call may cause a process to be moved 

structure used in the preferred embodiment of the pres- from one site to another in a computer system. This is 

ent invention. There is one CPUSTATE data structure essentially a more detailed version of FIG. 2A. 

for each processor in the system. For the purpose of explaining FIGS. 4 through 6. the 

The CPUSTATE data structure for each processor term "the process" is used to refer to the process which 

contains the following parameters. A Run Queue 50 has made a syscall or other subroutine call which has 

Header 40j points to the row of the process table 30 for caused a cross processor call to be made, 

the first process in the processor's run queue. A some- Whenever a syscall (i.e., a system subroutine call) is 

what simpler way to state this is to say that the Run made the system first checks to see if the syscall can be 

Queue Header 40fl points to the fust process in the run run on the current CPU of the process which made the 

queue for the corresponding processor. 55 call (box 70). If so, the syscall routine is performed 

The Current Proc parameter 40^7 points to the row of locally (box 86) and, assuming that the process is run- 
process table 30 for the process currently running in the ning on its home CPU (box 88), it returns to the process 
processor. The Last Proc parameter 40c points to the that performed the syscall (box 90). 
row of process table 30 for the last process which ran in If the syscall cannot be performed locally, the folio w- 
the processor before the current process began running. 60 ing steps are performed. The parameter in the process 

TTie Current Priority parameter 40rf is the priority table called Next CPU is set equal to a pointer to a 

value for the process currently running in the processor. processor (i.e., to the CPUSTATE data structure for a 

The Run Lock parameter 40e is a flag which is nor- processor) which can perform the syscall (box 76). 

mally equal to zero. Any processor which wants to Then a variable called Old Priority is set equal to the 

update or modify the processor's run queue is required 65 process's priority (box 78) (which is obtained from the 

to check this flag using a TAS instruction before pro- priority entry for the current process in the process 

cecding. If the Run Lock flag is not zero, then some table) so that this value can be restored after a context 

other processor is modifying the run queue and the first switch is performed. 
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Sl?^ wPlte sBTvicol as Qiricklv as possible SWITCH program continues by setting up the NE3CT 

that the process wfll be servicea as qmauy as possi CPU to nm the user process which is undergoing the 

'^•^! Z^JT^h itself is performed by calling a 5 context switch. These steps are performed wh™^"!! 
„J£e'SS'^cl?tx w™S I'descriLi c^-t switch ^^i°^'jJ>^V w^^^^ ^ 

below in detail with ^^^^ ^J^; l^J^;^; TtSe"? otL^SSS^S^yj^now called 
process is now running on the processor Pfu^^ed to p otherwise ready to exit (box 

preferred embotoent. the N«t CPU is alwys the ; ^^^^ ^j^^ 

main processor MP. but in other embodiments of to J^^^ ^gj 3^^^ the SWITCH routme 

invention the Next CPU could be any processor m the ''^^^^^2^ 

system. process's Otherwise, if the process (caUed LAST PROQ is 

Once the context ^witchhas be - ^ ^PU ^^xT CPU is not 

original priority is rcstor^(box 84) and the syscaU ^ CURRENT CPUT) (box 117) the 

routine is performed (86). Then tfacprocess is returnea g^RQ ^^^^^^5 q^j^ 120) to move the process 

to its home cite (boxes 88 to ^^)- Thjs is done by first -^^^^^ ^ lAsT PROC to its home site, 

checkmg to see if the current CPU is the process s ^ ^^^^ ^^^^ SWITCH is being sus- 

Home CPU (bo^i 8^. 20 pended O-c. bemg put on a sleep queue) or is not 

If the Current CPU is the process s Home CPU. no ^^^hing processors for any other reason (box 117) 

further processing is required and the routine returns ^^^^ SWITCH routine exits instead of calling 

(box 90). Otherwise a context switch back to the pro- seTRQ. Thus the SWITCH is used not only for con- 

cess's Home CPU is performed by setting the Next CPU ^^^^ switching, but also for activating a new process 

parameter equal to the process's Home CPU (box 92), ^5 whenever a CPU's currently running process is sus- 

cailing SWITCH (box 94), and then returning (box 90). pg^ded. 

In the preferred embodiment of the invention, the ^ -g ^ ^^art of the subroutine SETRQ 

restoration portion of the syscaU handling routine, vvhich is used by the context switching method dia- 

boxes 88 to 94. is kept simple because (1) there is only gjamed in FIG. 5. The SETRQ routine receives as a 

one processor to which processes are sent for handling 3^ parameter a pointer to a process which needs to be put 

special functions (i.e., the main processor), and (2) none ^^^^ ^j^g ^ q^gug of its NEXT CPU. as identified by 

of the syscall routines call other syscall routines. As a process table 30 (FIG. 1) for that pro- 

result, a process is always returned to its Home CPU ^,^5^ 

after a syscaU is complete. The first step (box 130) of the SETRQ routine is to 

As will be understood by those skilled in the art, in 35 ^^^^ queue of the process's NEXT CPU, using 

other embodiments of the invention a process might gn indivisible TAS instruction on the Run LOCK for 

"return" to a CPU other than the process's Home CPU. that CPU. As indicated above, if the Run LOCK for the 

In such a system a "Return CPU" parameter would NEXT CPU is already locked, the system waits for a 

have to be added to the system. In such a system, box 88 j^ort period of time and then tries the TAS instruction 
wiUbcrcplacedby a query regarding whether "Current 40 again, repeating this process until the Run LOCK is 

CPU=Reium CPU?", and if not, the process will be unlocked by whatever process was previously using it. 

SWITCHed to the Return CPU, which may or may not once the NEXT CPU's run queue is locked, the 

be the Home CPU. routine checks to see if the Current CPU is the same as 

FIG. 5 is a flow chart of the SWITCH routine used in the NEXT CPU (box 132). If so, the process is simply 
the preferred embodiment of the invention to move a 45 added to the run queue of the current (Le., NEXT) CPU. 

process from one site to another in a computer system. (box 136) and the nm queue for the NEXT CPU is 

The first step (box 100) of this routine is to save the unlocked (box 138) by setting its Run LOCK to zero, 

context of the current process by storing its registers if the current CPU is not the same as the NEXT CPU 

and such in its USER data structure (see FIG. 1). (box 132) then the Preemption Flag of the NEXT CPU 
Then the run queue for the Current CPU is locked 50 is set and the NEXT CPU is sent a software interrupt 

(using the the RUN LOCK 40e in the CPUSTATE data (box 134). Then the Current CPU parameter is set equal 

structure for the Current CPU) (box 104) and the LAST to NEXT CPU, the process is added to the run queue of 

PROC parameter 40c is set equal to CURRENT PROC the NEXT CPU (box 136) and the NEXT CPU's run, 

40^. This reflects the fact that the current process will queue is unlocked. 

no longer be running, and hence will be the last process 55 The purpose of setting the Preempt Flag and generat- 

to have run. ing an interrupt for the NEXT CPU (box 134) is to 

Next the CURRENT PROC parameter 40b is set force the NEXT CPU to service the process as soon as 

equal to a pointer to the process on the run queue with possible. This combination of steps forces the NEXT' 

the highest priority, and the CURRENT PRIORITY CPU to preempt the currently running process with the 
parameter AOd is set equal to this new process's priority 60 process added to the run queue, which has been given 

(box 108). Then the new CURRENT PROC is re- the highest user priority (see box 108 in FIG. 5). The 

moved from the run queue (by modifying the run reason that the interrupt is sent to the NEXT CPU 

queue's linked list using standard programming tech- before the process is added to the NEXT CPU's run 

niques) (box 110) and the Current CPU's run queue is queue is simply to speed up the process of transferring 
unlocked (box 112) by setting the RUN LOCK parame- 65 the process to the NEXT CPU. In effect, the NEXT 

ter 40e to zero. Finally, the new current process is CPU is forced to begin the process of looking for a new 

started up by restoring the process's context from its process as soon as possible, but will not actually look 

USER structure and "resuming" the process (box 114). through its run queue for the highest priority process 
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therein until its run queue is unlocked several instruc- 
tion cycles later. 

Process Creation and Home Site Selection 

Referring to FIG. 7, new processes are created in 5 
UNIX systems by a two step process: a fork (box 150) 
which duplicates an already existing process, and an 
"exec** step (box 152) which replaces the duplicated 
process's control program with a program firom a speci- 
fied ffle. 

In systems incorporating the present invention, pro- 
cess creation requires an additional step: selection of a 
home cite for the new process. In the preferred embodi- 
ment the selection of a home site works as follows. 

First (box 154), the process inspects the HOME- 
MASK parameter for the new process. The HOME- 
MASK parameter is loaded into the new process's 
USER structure when the process is created from the 
paren process at the fork step (box 150). It is a mask that 
indicates which CPUs the new process can use as a ^ 
home site. In particular, each bit of the HOMESITE 
parameter is equal to 1 if the corresponding CPU can be 
used as a home site, and is equal to 0 if is can't 

The remainder of the home site selection process is ^ 
restricted to sites allowed by the process's HOME- 
SITE. 

•Second (box 156), the system calculates the expected 
memory usage of the new process for each of the pro- 
cessors permitted by its HOMESITE parameter. The 
process may use less memory in some processors than in 
others because it may be able to share code already 
resident at those sites. 

Third (box 158), the system picks two candidate sites: 
(I) the site which would have the most memory left if 35 
the process were located there; and (2) the site with the 
smallest average run queue size which also has enough 
memory to run the process. 

Fourth (boxes 160-164), the system selects as the 
process's home site, the first candidate if its average ^ 
queue size is less than the the second candidate's aver- 
age queue size plus a preselected quantity QZ (a system 
parameter selected by the person setting up the system, 
typically equal to 0.5). Otherwise the second candidate 
site is selected as the home site. 45 

It should be noted that the average queue size for 
every processor is stored in the Performance Statistics 
portion 40^ of the processor's CPUSTATE data struc- 
ture, and is updated by the system's clock routine sixty- 
four times per second, 50 

Also, it can be seen that the selection of a home site is 
generally weighted in favor of sites with the most mem- 
ory available. 

Finally (boxes 166-168) the new process is moved to 
its new home site (if it isn't already there), by setting its 55 
HOME CPU parameter to the new site, copying its 
USER structure and resident memory into the memory 
of the home site, and putting the new process on the 
home site's run queue by calling SETRQ (not shown as 
a separate step in FIG. 7). 60 

Memory Management 

The UNIX kernel is initially copied from disk into the 
main processor's random access memory MEM..^P 
whenever the system is restarted. The other processors 65 
initially access the kernel's routines by performing non- 
local memory accesses over the syststem composition 
bus 24 to MEM_MP. 



For purpose of efficient operation, a copy of the 
portions of the UNIX kernel used by each application 
processor is copied mto local memory. This is done so 
that each processor can run kernel code without having 
to perform nonlocal memory access, which is very slow 
compared to local memory ac c ess 

The process by which the kernel code is copied to 
local memory is as follows. 

FIG. 8 is a block diagram of a virtual memory man- 
agement page table used in the preferred embodiment of 
the present invention. The use of page tables by the 
memory management imits is well known in the prior 
art However, the present invention makes special use 
of this table. 

For each page entry in the MMU page table there is 
a space for storing a real 0-e., physical) memory ad- 
dress, a * V* flag which indicates if the real memory 
address is valid (i.e., it indicates whether the page is 
currently stored in resident memory), a read only RO 
flag which indicates, if enabled, that the corresponding 
page cannot be overwritten; a MOD flag, which is 
enabled if the contents of the page have been modified 
since the page was first allocated; and a REF flag, 
which is enable if the contents of the page have been 
accessed in any way (i.e., either read or written to) since 
the page was first allocated. 

When the system is first started up, the MMU Page 
Tables in each processor contain entries for all of the 
UNIX kernel code, with real addresses in the main 
processor's memory MEM—MP. Every five seconds, a 
special program in the main processor writes into local 
memory a copy of all the kernel pages which have been 
referenced by each applications processor and which 
have not yet been copied into local memory. References 
to these pages are thereafter handled by reading local 
memory rather than the main processor's memory. 

While the present invention has been described with 
reference to a few specific embodiments, the descrip- 
tion is illustrative of the invention and is not to be -con- 
strued as limiting the invention. Various modifications 
may occur to those skilled in the art without departing 
from the true spirit and scope of the invention as defined 
by the appended claims. 

For instance, in other embodiments of the present 
invention, the system's functions could be distributed in 
such a way that different syscall routines might require 
cross processor calls to a plurality or multiplicity of 
different corresponding processors. 

Also, in some embodiments of the present invention 
there could be a dynamic load balancer which would 
periodically compare the loads on the various proces- 
sors in the system. The load balancer would be pro- 
grammed to select candidate processes for being shifted 
to new home sites, and to transfer a selected process to 
a new home site if the load on its current home site 
becomes much heavier than the load on one or more of 
the other processors in the system. 

What is claimed is: 

1. A computer system, comprising: 

a multiplicity of distinct central processing units 
(CPUs), each having a separate, local, random 
access memory means to which said CPU has di- 
rect access 

at least one interprocessor bus coupling said CPUs to 
said multiplicity of memory means, so that each 
CPU can access both its own local memory means 
and the memory means of the other CPUs; 
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run queue means coupled to said CPUs for holding a 
separate run queue for each of said CPUs; each said 
run queue holding a list of the processes waiting to 
run on the corresponding CPU; 

process creation means in at least one of said CPUs S 
for creating new processes, for assigning one of 

. said CPUs as the home site of each ^ew process, 
and for installing said new process in the local 
memory means for said home ^te; and 

cross processor call means in each of said CPU^for 10 
temporarily transferring a specified process from 
its home site to another one of said CPUs, for the 
purpose of performing a task which cannot be per- 
formed on said home site, said cross processor call 
means including means for: 1^ 

(a) placing said specified process on the run queue 
of said other CPU; 

(b) continuing the execution of said specified pro- 
cess on said other CPU, using the memory means 
for said specified process's home site as the resi- 20 
dent memory for said process and using said 
interprocessor bus means to couple said other 
CPU to said home site memory means, until a 
predefined set of tasks has been completed; and 
then 25 

(c) upon completion of said predefmed set of tasks, 
automatically returning said specified process to 
its home site by placing said specified process on 
the run* queue of said specified process's home 
site, so that execution of the process will resume 30 
on said specified process's home site. 

2. A computer system as set forth in claim 1, wherein 
said random access memory means of a first one of 
said CPUs includes kernel means having a prede- 
fmed set of software routines for performing prede- 35 
fmed kernel functions; 
said computer system further including 

memory management means coupled to said ran- 
dom access memory means of said CPUs, includ- 
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table means for denoting which portions of said 
kernel means are used by each of said CPUs 
other than said first CPU; and 
kernel copy means, coupled to said table means, for 
periodically copying into the local random ac- 45 
cess memory means of each of said other CPUs 
said kernel portions denoted in said table means 
as used by said CPU but not previously copied 
into the local random access memory means of 
said CPU; 50 
whereby the use of said interprocessor bus for access- 
ing said kernel means is reduced by providing cop- 
ies, in the local memory means of each CPU, of 
those portions of said kernel means actually used 
by each CPU. 55 
3. A computer system as set forth in claim 1, wherein 
said process creation means includes means for as- 
signing a home site priority to each new process, 
said home site priority being assigned a value 
within a predefmed range of priority values; 60 
said system further includes process selection means 
for selecting a process to run on a specified one of 
said CPUs, when the process currently nmning in 
said specified CPU is stopped, by selecting the 
process in said run queue for said specified CPU 65 
witii the highest priority; and 
said cross processor call means further includes 
means for 



(d) assigning a specified process a higher priority 
than its home site priority when it is added to the 
run queue for a CPU other than its home site; 
and 

(e) resetting the priority for said specified process 
to its home site priority when said process is 
added to the run queue for its home site; 

whereby a process transferred to a CPU other than its 
home site b given increased priority to accelerate 
selection of the process for miming on said other 
CPU. 

4. A computer system as set forth in claim 3, wherein 
at least one of said CPUs includes preemption means for 
finding the highest priority process in its run queue, said 
preemption means including means for stopping the 
process currently nmning said CPU, when said highest 
priority process has higher priority than the. process 
currentiy running in said CPU, and then running said 
highest priority process; 

at least one of said CPUs includes interrupt means for 
activating said preemption means in another one of 
said CPUs; and 

said cross processor call means further includes 
means for 

(0 using said interrupt means in said specified pro- 
cess's home site CPU to activate said preemption 
means in a specified CPU when said specified pro- 
cess is added to the run queue for said specified 
CPU; 

whereby a process transferred to a CPU other than its 
home site will be run immediately if its assigned 
priority is greater than the priorities assigned to the 
process currently running in said other CPU and to 
other processes, if any, in said run queue for said 
other CPU. 

5. A computer system, comprising: 

a multiplicity of distinct central processing units 
(CPUs), each having a separate, local, random 
access memory means to which said CPU has di- 
rect access; said CPUs having the capability of 
executing indivisible read modify write instruc- 
tions; 

at least one interprocessor bus coupling said CPUs to 
all of said memory means, so that each CPU can 
access both its own local memory means and the 
memory means of the other CPUs; 

run queue means coupled to said CPUs for holding a 
separate run queue for each of said CPUs; each said 
run queue holding a list of the processes waiting to 
run on the corresponding CPU; 

a run lock for each said run queue, said run lock 
having a first predefined value to indicate that the 
corresponding run queue is not in the process of 
being modified by any of said CPUs and is un- 
locked, and a value other than said first predefined 
value when the corresponding run queue is being 
modified by one of said CPUs and is therefore 
locked; 

run queue updating means coupled to said CPUs for 
adding or removing a specified process from a 
specified run queue, said run queue updating means 
including means for: 
(a) locking said specified run queue by 
(a,l) using an indivisible read modify write in- 
struction to test the value of the run lock for 
said specified run queue and, if said run lock 
value indicates that said specified run queue is 
unlocked, to set said run lock to a value which 
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tnrii/^^twft that said specified nm queue is 
locked; and 

(aj) if the test in step (a,l) determines that said 
run quaie is lcx:ked, perfonoing step (a,l) 
again after a predefined delay, until the test in 5 
step (al) determines that said run queue is 
unlocked; 

(b) adding or removing a specified process from 
said specified nm queue; and 

(c) unlocking said specified run queue by setting 10 
said run lock for said specified run queue to said 
first predefined value; 

process creation means in at least one of said CPUs 
for creating new processes* for assigning one of 
said CPUs as the home site of each new process, 15 
and for installing said new process in the local 
memory means for said home sit^ and 

cross processor call means in each of said CPUs for 
temporarily transferring a specified process from 
its home site to a specified one of said other CPUs, 20 
for the purpose of perforxning a task which cannot 
be performed on said home site, said cross proces- 
sor call means including means for: 

(a) using said run queue updating means to add said 
specified process to the run queue of said spcci- 25 
fiedCPU; 

(b) continuing the execution of said specified pro- 
cess on said specified CPU» using the memory 
means for said specified process's home site as 
the resident memory for said process and using 30 
said inteq>rocessor bus means to couple said 
specified CPU to said home site memory means, 
until a predefmed set of tasks has been com- 
pleted; and then 

(c) upon completion of said predefined set of tasks* 35 
automatically returning said specified process to 
its home site by using said run queue updating 
means to add said specified process to the run 
queue of said specified process's home site, so 
that execution of the process will resume on said 40 
specified process's home site. 

6. A computer system as set forth in claim 5, wherein 
said process creadon means includes means for as- 
signing a home site priority to each new process, 
said home site priority being assigned a value 45 
within a predefined range of priority values; 

said system further includes process selection means 
for selecting a process to run on a specified one of 
said CPUs, when the process currently running in 
said specified CPU is stopped, by selecting the 50 
process in said run queue for said specified CPU 
with the highest priority; and 

said cross processor call means further includes 
means for 

(d) assigning a specified process a higher priority 55 
than its home site priority when it is added to the 
run queue for a CPU other than its home site; 
and 

(e) resetting the priority for said specified process to 
its home site priority when said process b added to 60 
the run queue for its home site; 

whereby a process transferred to a CPU other than its 
home site is given increased priority to accelerate 
selection of the process for running on said other 
CPU. 65 

7. A computer system as set forth in claim 6, wherein 
at least one of said CPUs includes preemption means 

for finding the highest priority process in its run 
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queue, said preemption means indoding means for 
stopping the process currently running said CPU, 
wheo said highest priority process has higher prior- * 
ity »han the process curreiUiy running in said CPU, 
and then running said highest priority process; 

at least one of said CPUs includes interrupt means for 
activating said preemption means in another one of 
said CPUs; and 

said cross processor call means further includes 
means for 

(0 using said interrupt means in said specified pro- 
cess's home site CPU to acdvate said preempdon 
means in a specified CPU when said specified 
process is added to the nm queue for said speci- 
fied CPU; 

whereby a process transferred to a CPU other than its 
home site will be run immediately if its assigned 
priority is greater than the priorities assigned to the 
process currently running in said other CPU and to 
other processes, if any, in said run queue for said 
other CPU. 

8. A computer system, comprising: 

a multiplicity of distinct central processing units 
(CPUs), each having a separate, local, random 
access memory means to which said CPU has di- 
rect access; said CPUs having the capability of 
executing indivisible read modify write instruc- 
tions; 

at least one interprocessor bus coupling said CPUs to 
all of said memory means, so that each CPU can 
access both its own local memory means and the 
memory means of the other CPUs; 

process creadon means coupled to said CPUs for 
creating new processes, for assigning one of said 
CPUs as the home site of each new process, for 
installing said new process in the local memory 
means for said home site, and for assigning a home 
site priority to each new process, said home site 
priority being assigned a value within a predefined 
range of priority values; 

run queue means coupled to said CPUs for holding a 
separate run queue for each of said CPUs; each said 
run queue holding a list of the processes waiting to 
nm on the corresponding CPU; 

process table means coupled to said CPUs for retain- 
ing information regarding every process running or 
otherwise in existence in said system, including for 
each said process 

a HOME CPU parameter which indicates the 

home site of said process; 
a CURRENT CPU parameter which indicates the 

current CPU on which said process is running, 

waiting to run, or otherwise residing; and 
a PRIORITY parameter indicative of the priority 

of said process; 
cpustate table means for storing information regard- 
ing each said CPU, including: 
a run queue header identifying the run queue for 

said CPU; 

a current process parameter identifying the process 
currently running in said CPU; 

a last process parameter identifying the process 
which was run prior to the process currently 
running in said CPU; and 

a run lock parameter which is given a first prede- 
fined value to indicate that the corresponding 
run queue is not in the process of being modified 
by any of said CPUs and is unlocked, and a value 
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other than said first predefined value when the 
corresponding run queue is bemg modified by 
one of said CPUs and is therefore locked; and 
run queue updatmg means coi^led to said CPUs for 
adding or removing a specified process from a 5 
specified run queue, said run queue updating means 
ictcluding means fon 

(a) locking said specified run queue by 

(a,l) using said indivisible read modify write 
instruction to test the value of the run lock for ^0 
said specified run queue and, if said run lock 
value mdicates that said specified run queue is 
unlocked, to set said run lock to a value which 
indicates that said specified run queue is 
locked; and 

(a,2) if the test in step (a,l) determines that said 
run queue is locked, performing step (a,l) 
again after a predefined delay, until the test in 
step (al) determines that said run queue is 
unlocked; 

(b) adding or removing a specified process from 
said specified run queue; 

(c) unlocking said specified run queue by setting 
said run lock for said specified run queue to said 
first predefined value; and 

(d) updating said run queue header, current process 
and last process parameters of the cpustate table 
means for the CPU corresponding to said speci- 
fied run queue to reflect the current status of said 
CPU; 

process selection means in at least one of said CPUs 
for selecting a process to run on a specified one of 
said CPUs when the process currently running in 
said specified CPU is stopped, including means for 35 
selecting the process in said run queue for said 
specified CPU with the highest priority and means 
for initiating the nmning of said selected process in 
said specified CPU; and 

preemption means in each CPU for finding the high- 4Q 
est priority process in its run queue, said preemp- 
tion means including means for stopping the pro- 
cess currently running in said CPU, when said 
highest priority process has higher priority than 
the process currently running in said CPU, and 45 
then running said highest priority process; 

interrupt means in each said CPU for activating said 
preemption means in a specified one of the other 
CPU^ and 

cross processor call means in each of said CPUs for 50 
temporarily transferring a specified process from 
its home site to a specified one of said other CPUs, 
for the purpose of performing a task which cannot 
be performed on said home site, said cross proces- 
sor call means including means for: S5 

(a) using said run queue updating means to add said 
specified process to the run queue of said speci- 
fied CPU; 

(b) assigning said specified process a higher prior- 
ity than its home site priority when it is added to 60 
said run queue for said specified other CPU; 

(c) using said interrupt means in said specified pro- 
cess's home site CPU to activate said preemption 
means in said specified CPU when said specified 
process is added to said run queue for said speci- 65 
fied CPU, so that said specified process will be 
preempt the process currently running in said 
specified CPU; 
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(d) continuing the execution of said specified pro- 
cess on said specified CPU, using the memory 
means for said specified process's home site as 
the resident memory for said process and using 
said interprocessor bus means to couple said 
specified CPU to said home site memory means, 
until a predefined set of tasks has been com- 
pleted; and then 

(e) upon completion of said predefined set of tasks, 
automatically returning said specified process to 
its home site by using said run queue updating 
means to add said specified process to the run 
queue of said specified process's home site, so 
that execution of the process will resume on said 
specified process's home site; and 

(f) resetting the priority for said specified process 
to its home site priority when said process is 
added to said run queue for its home site. 

9. A method of running a multiplicity of processes in 
computer system, comprising the steps of: 
providing a computer system having 
(1) a multiplicity of distinct central processing units 
(CPUs), each having a separate, local, random 
access memory means to which said CPU has 
direct access; said CPUs having the capability of 
executing indivisible read modify write instruc- 
tions; and 

(2) at least one interprocessor bus coupling said CPUs 
to all of said memory means, so that each CPU can 
access both its own local memory means and the 
memory means of the other CPUs; 
wherein at least one of said CPUs is capable of 
perfonning one or more tasks thai at leas; one of 
said other CPUs cannot perform; 

generating a run queue data structiu-e for holding a 
separate run queue for each of said CPUs; each said 
run queue holding a list of the processes waiting to 
run on the corresponding CPU; 

providing run queue updating means for adding or 
removing a specified process from a specified run 
queue; 

creating new processes, as the need arises, including 
the step of assigning one of said CPUs as the home 
site of each new process, and installing said new 
process in the local memory means for said home 
site; and 

when one of said processes needs to perform a task 
which cannot be performed on its home site, per- 
forming a cross processor call to temporarily trans- 
fer said process from its home site to a specified one 
of said other CPUs which is able to perform said 
task, be performing the steps of: 

(a) using said run queue updating means to add said 
process to the run queue of said specified CPU; 

(b) continuing the execution of said process on said 
specified CPU, using the memory means for said 
process's home site as the resident memory for 
said process and using said interprocessor bus 
means to couple said specified CPU to said home 
site memory means, until a predefined set of tasks 
has been completed; and then 

(c) upon completion of said predefmed set of tasks, 
automatically returning said specified process to 
its home site by using said run queue updating 
means to add said process to the run queue of 
said process's home site, so that execution of the 
process will resume on said process's home site. 
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10. A method as set forth in claim 9» farther inciading 
the fttep of generating a nm lock flag for each said mn 
queue, said run lock having a first predefined value to 
indicate that the corresponding run queue is not in the 
process of being modified by any of said CPUs and is ^ 
unlocked, and a value other than said first predefmed 
value when the corresponding run queue is being modi- 
fied by one of said CPUs and is therefore locked; 

wherein said step of providing run queue updating 
means includes providing run queue updating 
means for adding or removing a specified process 
from a specified run queue by performing the steps 
of: 

(a) locking said specified run queue by ^ 
(a,l) using said indivisible read modify write 

instruction to test the value of the nm lock for 
said specified run queue and, if said run lock 
value indicates that said specified run queue is 
unlocked^ to set said run lock to a value which 20 
indicates that said specified run queue is 
locked; and 

(a,2) if the test in step (a,l) determines that said 
run queue is locked, performing step (a,l) 
again after a predefined delay, until the test in 25 
step (al) determines that said run queue is 
unlocked; 

(b) adding or removing a specified process from 
said specified run queue; and 

(c) unlocking said specified run queue by setting 
said run lock for said specified run queue to said 
first predefmed value. 

11. A mcihod as set forth in claim 9, wherein said step 
of creating new processes includes assigning a home site ^ ^ 
priority to each new process, said home site priority 
being assigned a value within a predefmed range of 
priority values; 

said method further including the step of selecting a 
process to run on a specified one of said CPUs, ^ 
when the process currently running in said speci- 
fied CPU is stopped, by selecting the process in 
said run queue for said specified CPU with the 
highest priority; and 

said step of performing a cross processor call further 45 
includes die steps of 

(d) assigning a sp)ecified process a higher priority 
than its home site priority when it is added to the 



run queue for a CPU other than its home site; 
and 

(e) resettmg the priority for said specified process to 
its home site priority when said process is added to 
the nm queue for its home site; 

wherd>y a process transferred to a CPU other than its 
home site is given increased priority to accelerate 
selection of the process for running on said other 
CPU. 

12. A method as set forth in claim 11, wherein said 
step of performing a cross processor call further in- 
cludes the step of 

(f) generating a preemption interrupt in said speci- 
fied CPU when said specified process is added to 
the run queue for said specified CPU; 
said method further including the step of responding 

to a preemption interrupt in a specified CPU by: 

(a) finding the highest priority process in the run 
queue of said specified CPU; and 

(b) stopping the process currently running in said 
CPU, when said highest priority process has 
higher priority than the process currently run- 
ning in said CPU, and then running said highest 
priority process; 

whereby a process transferred to a CPU other than its 
home site will be run immediately if its assigned 
priority is greater than the priorities assigned to the 
process currently running in said other CPU and to 
other processes, if any, in said run queue for said 
other CPU. 

13. A method as set forth in claim 9, further including 
the steps of: 

providing a predefmed set of kernel routines in said 
local memory means of a first one of said CPUs; 

denoting, in a predefined data structure, which of said 
kernel routines are used by each of said CPUs other 
than said first CPU; and 

periodically copying into the local random access 
memory means of each of said other CPUs said 
kernel routines denoted in said predefined data 
structure as used by said CPU but not previously 
copied into the local random access memory means 
of said CPU; 

whereby the use of said interprocessor bus for access- 
ing said kernel routines is reduced by providing 
copies, in the local memory means of each CPU, of 
those kernel routines actually used by each CPU. 
• • « * * 



55 



60 



65 



